hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject RE: How to efficiently join HBase tables?
Date Tue, 31 May 2011 15:42:27 GMT
Eran's observation was that a join is solvable in a Mapper via lookups on a 2nd HBase table,
but it might not be that efficient if the lookups are 1 by 1.  I agree with that.

My suggestion was to use multi-Get for the lookups instead.  So you'd hold onto a batch of
records in the Mapper and then the batch size is filled, then you do the lookups (and then
any required emitting, etc.).  



-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Tuesday, May 31, 2011 10:56 AM
To: user@hbase.apache.org
Subject: RE: How to efficiently join HBase tables?


Maybe I'm missing something... but this isn't a hard problem to solve.

Eran wants to join two tables.
If we look at an SQL Statement...

SELECT A.*, B.*
FROM A, B
WHERE A.1 = B.1
AND  A.2 = B.2
AND  A.3 = xxx
AND A.4 = yyy
AND B.45 = zzz

Or something along those lines.

So what you're essentially doing is saying I want to take a subset of data from table A, and
a subset of data from table B and join them on the values in columns 1 and 2.
Table A's data will be filtered on columns 3 and 4 and B's data will be filtered on column
45. NOTE: since you don't know the relationship of the column names to either table, you're
safer in writing tableA|column_name and tableB|column_name to your temp table.

So if you create a temp table FOO where the key is column 1 and column 2 (column1|column2)
then when you walk through the subsets adding them to the temp table, you will get the end
result automatically.

Then you can output your hbase temp table and then truncate the table.

So what am I missing?

-Mike


> From: doug.meil@explorysmedical.com
> To: user@hbase.apache.org
> Date: Tue, 31 May 2011 10:22:34 -0400
> Subject: RE: How to efficiently join HBase tables?
> 
> 
> Re:  "The problem is that the few references to that question I found recommend pulling
one table to the mapper and then do a lookup for the referred row in the second table."
> 
> With multi-get in .90.x you could perform some reasonably clever processing and not do
the lookups one-by-one but in batches.
> 
> Also, if the other table is "small" you could have the leverage the block cache on the
lookups (i.e., if it's a domain/lookup table).  
> 
> 
> 
> -----Original Message-----
> From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of Eran 
> Kutner
> Sent: Tuesday, May 31, 2011 8:06 AM
> To: user@hbase.apache.org
> Subject: How to efficiently join HBase tables?
> 
> Hi,
> I need to join two HBase tables. The obvious way is to use a M/R job for that. The problem
is that the few references to that question I found recommend pulling one table to the mapper
and then do a lookup for the referred row in the second table.
> This sounds like a very inefficient way to do  join with map reduce. I believe it would
be much better to feed the rows of both tables to the mapper and let it emit a key based on
the join fields. Since all the rows with the same join fields values will have the same key
the reducer will be able to easily generate the result of the join.
> The problem with this is that I couldn't find a way to feed two tables to a single map
reduce job. I could probably dump the tables to files in a single directory and then run the
join on the files but that really makes no sense.
> 
> Am I missing something? Any other ideas?
> 
> -eran
 		 	   		  

Mime
View raw message