hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: How to efficiently join HBase tables?
Date Tue, 31 May 2011 14:20:33 GMT


You want to join two tables? The short answer is to use a relational database to solve that

Longer answer:

You're using HBase so you don't need to think in terms of a reducer.
You can create a temp table for your query.
You can then run one map job to scan and filter table A, dumping the result set in to the
temp table
In parallel, you run a map job to scan and filter table B, dumping the result set in to the
temp table.

Voila! You're done. Just remember to clean up and drop the temp table when you're done.

But there may be a problem.
If you use the same column name but the data means different things.  Like both tables have
a column named 'Tim' (and why you would name something Tim is beyond me... ;-) ) but this
column means one thing in table A and something else in table B and you want to retain both
values... You just need to create a column whose name is based on ${tablename}+'|'+${column
name} so it would be TableA|Tim and TableB|Tim.



> From: eran@gigya.com
> Date: Tue, 31 May 2011 15:43:43 +0300
> Subject: Re: How to efficiently join HBase tables?
> To: ferdy.galema@kalooga.com
> CC: user@hbase.apache.org
> MutipleInputs would be ideal, but that seems pretty complicated.
> MultiTableInputFormat seems like a simple change in the getSplits() method
> of TableInputFormat + support for a collection of table and their matching
> scanners instead of a single table and scanner, doesn't sound too
> complicated.
> Any other suggestions?
> -eran
> On Tue, May 31, 2011 at 15:31, Ferdy Galema <ferdy.galema@kalooga.com>wrote:
> > As far as I can tell there is not yet a build-in mechanism you can use for
> > this. You could implement your own InputFormat, something like
> > MultiTableInputFormat. If you need different map functions for the two
> > tables, perhaps something similar to Hadoop's MultipleInputs should do the
> > trick.
> >
> >
> > On 05/31/2011 02:06 PM, Eran Kutner wrote:
> >
> >> Hi,
> >> I need to join two HBase tables. The obvious way is to use a M/R job for
> >> that. The problem is that the few references to that question I found
> >> recommend pulling one table to the mapper and then do a lookup for the
> >> referred row in the second table.
> >> This sounds like a very inefficient way to do  join with map reduce. I
> >> believe it would be much better to feed the rows of both tables to the
> >> mapper and let it emit a key based on the join fields. Since all the rows
> >> with the same join fields values will have the same key the reducer will
> >> be
> >> able to easily generate the result of the join.
> >> The problem with this is that I couldn't find a way to feed two tables to
> >> a
> >> single map reduce job. I could probably dump the tables to files in a
> >> single
> >> directory and then run the join on the files but that really makes no
> >> sense.
> >>
> >> Am I missing something? Any other ideas?
> >>
> >> -eran
> >>
> >>
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message