hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Kutner <e...@gigya.com>
Subject Re: How to efficiently join HBase tables?
Date Tue, 31 May 2011 18:42:58 GMT
Thanks everyone for the great feedback. I'll try to address all the
suggestions.

My data sets go between large and very large. One is in the order of many
billions of rows, although the input for a typical MR job will be in the
hundreds of millions, the second table is in the tens of millions. I doubt a
SQL DB will handle this kind of a join in a reasonable manner.

Doing batched lookup will indeed be more efficient than one by one but it
will require the mapper to manage a local state between multiple calls,
which is something I don't really like doing, and worse, it doesn't really
solve the lookup problem it only moves it one tier lower. Instead of the
mapper having to do all thouse random lookups, now HBase itself will have to
do them. Granted it is more efficient than individual lookup API calls but
it is not nearly as efficient as doing sequential reads.

Finally, the temp table method, that will work but again, I suspect it will
be a lot less efficient than the sequence files Hadoop would generate. The
join output is expected to be in the tens of millions of rows, each with
multiple columns. From some tests I've done, writing this number of rows to
a clean table starts out very slowly and takes a lot of time to ramp up as
the regions begin to split and move around the cluster. I should say that
the output of this join is just the input for another MR job, so it would
really be just a temp table and not something that would be useful after
that.

I should also say that I have looked into eliminating the lookup altogether
by resolving the data from the second table before the rows are inserted to
the main table, kind of denormalization, but that would introduce an
unacceptable latency to a very high volume process.

Still looking for other ideas.

-eran



On Tue, May 31, 2011 at 18:42, Doug Meil <doug.meil@explorysmedical.com>wrote:

> Eran's observation was that a join is solvable in a Mapper via lookups on a
> 2nd HBase table, but it might not be that efficient if the lookups are 1 by
> 1.  I agree with that.
>
> My suggestion was to use multi-Get for the lookups instead.  So you'd hold
> onto a batch of records in the Mapper and then the batch size is filled,
> then you do the lookups (and then any required emitting, etc.).
>
>
>
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Tuesday, May 31, 2011 10:56 AM
> To: user@hbase.apache.org
> Subject: RE: How to efficiently join HBase tables?
>
>
> Maybe I'm missing something... but this isn't a hard problem to solve.
>
> Eran wants to join two tables.
> If we look at an SQL Statement...
>
> SELECT A.*, B.*
> FROM A, B
> WHERE A.1 = B.1
> AND  A.2 = B.2
> AND  A.3 = xxx
> AND A.4 = yyy
> AND B.45 = zzz
>
> Or something along those lines.
>
> So what you're essentially doing is saying I want to take a subset of data
> from table A, and a subset of data from table B and join them on the values
> in columns 1 and 2.
> Table A's data will be filtered on columns 3 and 4 and B's data will be
> filtered on column 45. NOTE: since you don't know the relationship of the
> column names to either table, you're safer in writing tableA|column_name and
> tableB|column_name to your temp table.
>
> So if you create a temp table FOO where the key is column 1 and column 2
> (column1|column2) then when you walk through the subsets adding them to the
> temp table, you will get the end result automatically.
>
> Then you can output your hbase temp table and then truncate the table.
>
> So what am I missing?
>
> -Mike
>
>
> > From: doug.meil@explorysmedical.com
> > To: user@hbase.apache.org
> > Date: Tue, 31 May 2011 10:22:34 -0400
> > Subject: RE: How to efficiently join HBase tables?
> >
> >
> > Re:  "The problem is that the few references to that question I found
> recommend pulling one table to the mapper and then do a lookup for the
> referred row in the second table."
> >
> > With multi-get in .90.x you could perform some reasonably clever
> processing and not do the lookups one-by-one but in batches.
> >
> > Also, if the other table is "small" you could have the leverage the block
> cache on the lookups (i.e., if it's a domain/lookup table).
> >
> >
> >
> > -----Original Message-----
> > From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of Eran
> > Kutner
> > Sent: Tuesday, May 31, 2011 8:06 AM
> > To: user@hbase.apache.org
> > Subject: How to efficiently join HBase tables?
> >
> > Hi,
> > I need to join two HBase tables. The obvious way is to use a M/R job for
> that. The problem is that the few references to that question I found
> recommend pulling one table to the mapper and then do a lookup for the
> referred row in the second table.
> > This sounds like a very inefficient way to do  join with map reduce. I
> believe it would be much better to feed the rows of both tables to the
> mapper and let it emit a key based on the join fields. Since all the rows
> with the same join fields values will have the same key the reducer will be
> able to easily generate the result of the join.
> > The problem with this is that I couldn't find a way to feed two tables to
> a single map reduce job. I could probably dump the tables to files in a
> single directory and then run the join on the files but that really makes no
> sense.
> >
> > Am I missing something? Any other ideas?
> >
> > -eran
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message