hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: hbase bulk load / table split
Date Tue, 04 Jan 2011 17:09:15 GMT


Marc,

Just an idea. 
Can you create your table with N regions with null as the start key and the largest possible
key value and your end key?
Talking with ssechrist on IRC he pointed me to the API :
http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable%28org.apache.hadoop.hbase.HTableDescriptor,%20byte[],%20byte[],%20int%29

And looking at the createTable() methods....

HTH

-Mike

> From: mslimotte@gmail.com
> Date: Tue, 4 Jan 2011 06:03:21 -0800
> Subject: hbase bulk load / table split
> To: hbase-user@hadoop.apache.org
> 
> I've made some good progress using the HBase Bulk Load Tool.   With HBase
> 0.89.20100924+28.
> 
> My initial implementation did not have importtsv do compression, and it ran
> directly on the hbase cluster's hadoop.  It's been working ok for a while
> (but slow, because of limited resources).
> 
> My next implementation, as discussed in another thread, has compression
> settings turned on for importtsv (thanks, Lars).  And I am running the
> importtsv on a remote cluster and then distcp'ing (thanks, Todd) the results
> to the HBase cluster for the completebulkload step.
> 
> I'm trying this out with a fresh (empty) Hbase table.  So, the first run of
> importtsv takes a very long time, because the table only has one region, so
> it starts only one Reducer.
> 
>    - Bulk load into a new table
>    - About 20 GB of data (compressed with gzip)
>    - Created one massive region
> 
> It seemed to complete successfully.  But we are seeing some intermittent
> errors (missing blocks and such).
> 
> Could not obtain block: blk_-5944324410280250477_429443
> > file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436
> >
> 
> The initial region seems to have split once, but I'm not sure the split
> completed, since the key ranges overlap and the storeFileSizeMB seems to be
> about as big as it started out.  My theory is that the initial load is too
> large for a region, and the split either failed or is still in progress.
> 
>  Both on the same Region Server:
> >
> 
> >
> mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296.
> > stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0,
> > storefileIndexSizeMB=784
> > --
> > mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d.
> > stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0,
> > storefileIndexSizeMB=796
> >
> 
> Another new table on this same hbase loaded around the same time, has
> already split into 69 regions (storefileSizeMB 200 - 400 each).  This one
> was loaded in smaller chunks with importtsv running directly on the hbase
> cluster, but also with compression on.
> 
> Now that I've gotten all the background down, here are my questions:
> 
>    1. Is it still working on the split?  Any way to monitor progress?
>    2. Can I force more splits?
>    3. Should I have done something first to avoid having the bulk load
>    create one big region?
>    4. Would it be easier to split if my initial bulkload was not gzip
>    compressed?
>    5. Am I looking in the wrong place entirely for this issue?
> 
> thanks,
> Marc
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message