hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Kutner <e...@gigya.com>
Subject Re: How to efficiently join HBase tables?
Date Tue, 31 May 2011 12:43:43 GMT
MutipleInputs would be ideal, but that seems pretty complicated.
MultiTableInputFormat seems like a simple change in the getSplits() method
of TableInputFormat + support for a collection of table and their matching
scanners instead of a single table and scanner, doesn't sound too
Any other suggestions?


On Tue, May 31, 2011 at 15:31, Ferdy Galema <ferdy.galema@kalooga.com>wrote:

> As far as I can tell there is not yet a build-in mechanism you can use for
> this. You could implement your own InputFormat, something like
> MultiTableInputFormat. If you need different map functions for the two
> tables, perhaps something similar to Hadoop's MultipleInputs should do the
> trick.
> On 05/31/2011 02:06 PM, Eran Kutner wrote:
>> Hi,
>> I need to join two HBase tables. The obvious way is to use a M/R job for
>> that. The problem is that the few references to that question I found
>> recommend pulling one table to the mapper and then do a lookup for the
>> referred row in the second table.
>> This sounds like a very inefficient way to do  join with map reduce. I
>> believe it would be much better to feed the rows of both tables to the
>> mapper and let it emit a key based on the join fields. Since all the rows
>> with the same join fields values will have the same key the reducer will
>> be
>> able to easily generate the result of the join.
>> The problem with this is that I couldn't find a way to feed two tables to
>> a
>> single map reduce job. I could probably dump the tables to files in a
>> single
>> directory and then run the join on the files but that really makes no
>> sense.
>> Am I missing something? Any other ideas?
>> -eran

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message