hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eric_...@yahoo.com
Subject Re: using composite index for a duplicate check validation
Date Wed, 23 Feb 2011 07:46:38 GMT
Thanks a lot Jean-Daniel,

I will try disabling the cache to see if I get a performance improvement.  I was 
not aware of the parallel scan.  I will look into that.

From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Sent: Wed, February 23, 2011 1:20:59 AM
Subject: Re: using composite index for a duplicate check validation

A Get is a random read, so expect it to be slower than let's say a
scanner or a random insert (the other calls that are made in your
code). Unless you are able to keep all that data in the block cache of
the region servers, those calls are going to be expensive.

A change that would be very easy to do would be to disable the caching
of the data read from the scans (done by the maps) by using
setCaching(false) on the Scan object you're passing to
TableMapReduceUtil. This will make it so that the scanning won't trash
the block cache, but my understanding of your use case is that the
reads will all be done only once, so caching them wouldn't really help

Maybe you could also consider running a parallel scan on the index
table instead of random getting from it, but it depends on how the
index was constructed. Maybe you can come up with a scan that makes
more sense.

Hope that helps,


On Mon, Feb 21, 2011 at 12:34 AM,  <eric_bdr@yahoo.com> wrote:
> Hi,
> I am currently building a multi-tenant ERP-like application capable of 
> billions of transaction lines.  I a using HBase 0.90 and wrote an end-to-end
> initial POC to test the performance characteristics.  Here is my end-to-end 
> case:
> 1. load a sizable transaction submission file (csv).  The file has about 150
> attribute and typically about 2,000,000 lines in size
> 2. validate the file (resolve a bunch of attributes such as product, members,
> dates, amounts, against an effective-dated base data set)
> 3. flag all lines as duplicate of {list of line ids} if this line has the same
> values for 'user-defined' set of columns.
> 4. run some calculations against these lines filtered by some user-defined
> filters
> I have a cluster of 10 basic machines (1 used as master and 9 as slaves).
> so here is what I am doing (I have about 10,000,000 lines in the entity_table 
> this point):
> 1. load file in HBase table called 'entity_table' using a mapper, passing it a
> file format definition object that understands how to parse the file
> 2. index the file by creating a new table called 'entity_dup_index' where the
> row key = entity.itemId+"_"+entity.memberId+"_"+entity.invoiceDate with a 
> column family into which I add the entity.key as key and value under it.
> 3. run the validation step:  the dupcheck will
> to entity_dup_index.get(index_key) and loop over all key/value pairs, removing
> all keys that are <= than entity.key to ensure that I am a dup only of
> previously loaded lines.
> The question that I have is about performance:
> 1. Loading a 2,000,000 line file into HBase takes about 15 mins
> 2. indexing 2,000,000 lines takes about 3 mins (indexing 6,000,000 takes about 
> mins)
> 3. running the dup check on 2,000,000 takes over 1 hour.
> public void map(ImmutableBytesWritable row, Result result, Context context)
> throws IOException, InterruptedException {
> // reset validations
> Delete delete = new Delete(row.get());
> delete.deleteFamily(valFam.getBytes());
> Put put = new Put(row.get());
> // ***********************************
> String key = getCompositeIndexKey(result);
> HTable indexTable // initialized at setConf()
> Get get = new Get(key.getBytes());
> Result rr = indexTable.get(get);
> // loop over all KeyValues of rr
> put.add(valFam.getBytes(), ...);
> // ***********************************
> context.write(tableName, delete);
> context.write(tableName, put);
> }
> the indexTable.get(get) call is the culprit!  when I comment out this code, 
> validation runs under 15 mins.  Would you have some idea on how I could 
> the composite index lookup or structure my algorithm differently to get better
> performance?
> Thanks a lot for your help,
> -Eric

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message