hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: How to do "Group By" in HBase
Date Thu, 01 Apr 2010 16:34:17 GMT
For 1/2, it seems that your row key design is ideal for those queries.  You say it's inefficient
because you need to scan the "whole session of data" containing hammer... but wouldn't you
always have to do that unless you were doing some kind of summary/rollups?  Even in a relational
database you scan lots of stuff!

I can't say I fully understand query #3.  You select a single product and single store, but
then group by product and store?  Is that group by required if you just want a single SUM()
result?  Sorry my SQL is rusty :)

If you're really just after counts of stuff, you could potentially keep counters rather than
scanning all the original source data.  Check out HTable.incrementColumnValue().  You could
have a table with similar rows but that were rolled up on different dimensions.

For example, if you wanted to roll-up counts for each product by each day, you could have
row=product:day and then could have a single column or column for each store.  In that column
you would use incrementColumnValue to increment for each num().  Then you could do a single
row query to get all of the counts for that product for that day, by store.  If you wanted
for a range of days, you could scan the days you need, but this would be more efficient than
scanning the actual orders since each cell is just a counter.

Make sense?


> -----Original Message-----
> From: Sean [mailto:seanatpurdue@hotmail.com]
> Sent: Thursday, April 01, 2010 1:46 AM
> To: hbase-user@hadoop.apache.org
> Subject: How to do "Group By" in HBase
> I have the follow kind of data (a typical store sell record):
> {product, date, store_name} --> number
> I understand that if I choose the following row key design, I will be
> able to quickly GROUP BY store_name.
> row key -- product:date:store_name
> column name -- number
> In other words, I can efficiently achieve the following logic (just a
> HBase scan) -- adjacent scan.
> 1) SELECT SUM(num) FROM sale_history_table where product="hammer" GROUP
> BY product
> 2) SELECT SUM(num) FROM sale_history_table where product="hammer",
> date="12/04/2009" GROUP BY product date
> However, it's very inefficient to do the following thing because to
> achieve this, I basically need to scan the whole session of data that
> containing "hammer"
> 3) SELECT SUM(num) FROM sale_history_table where product="hammer",
> store_name="SFO_AIRPORT" GROUP BY product store_name
> Can someone give me an advice on what I should design my HBase schema
> if I choose to use native Hbase (I am thinking a second table may help
> case 3, but have not come up with an idea)?
> ( I understand Zohmg is good at these kind of problem, but I'd rather
> choose it as the last resort)
> Thanks,
> Sean
> The New Busy is not the old busy. Search, chat and e-mail from your
> inbox. Get started.
> _________________________________________________________________
> Hotmail is redefining busy with tools for the New Busy. Get more from
> your inbox.
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL
> :ON:WL:en-US:WM_HMP:032010_2

View raw message