hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: NotServingRegionException - Map/Reduce process fails
Date Thu, 23 Oct 2008 19:20:05 GMT
Dru: If compactions are taking 4minutes, then your instance is being 
overrun; its unable to keep up with your rate of upload.  Whats your 
upload rate like?  How are you doing it?  Or is it that your servers are 
buckled carrying the load?  Are they swapping?   Usually compaction runs 
fast.  It'll take long if its compacting many more than the threshold.  
Grep your logs and see if compactions are taking steadily longer?   Do 
you have a lot of blocking happening in your logs (where the 
regionserver puts up temporary block of updates because it isn't able to 
flush fast enough).  You're on recent hbase?  Have you altered flush or 
maximum region file sizes?


Dru Jensen wrote:
> Stack and J-D, Thanks for your responses.
> It looks like the RetriesExhaustedException occurred during:
> 2008-10-23 11:08:55,180 INFO 
> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on 
> region ... 1224785065371 in 4mins, 25sec
> It doesn't look like I am having the HBASE-921 issue (yet).
> What settings can I change to cause the compaction to not take so long?
> I found this setting:
> <property>
>     <name>hbase.hstore.compactionThreshold</name>
>     <value>3</value>
>     <description>
>     If more than this number of HStoreFiles in any one HStore
>     (one HStoreFile is written per flush of memcache) then a compaction
>     is run to rewrite all HStoreFiles files as one.  Larger numbers
>     put off compaction but when it runs, it takes longer to complete.
>     During a compaction, updates cannot be flushed to disk.  Long
>     compactions require memory sufficient to carry the logging of
>     all updates across the duration of the compaction.
>     If too large, clients timeout during compaction.
>     </description>
> </property>
> Should I lower this or is there a better way?
> Thanks,
> Dru
> On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
>> Dru.
>> See also if it's a case of
>> HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
>> would make sense if not using hbase 0.18.1 and under a heavy
>> load.
>> J-D
>> On Thu, Oct 23, 2008 at 2:30 PM, stack <stack@duboce.net> wrote:
>>> Find the MR task that failed.  Click through the UI to look at its 
>>> logs.
>>> It may have interesting info.  Its probably complaining about a 
>>> region not
>>> being available (NSRE).  Figure which region it is.  Use the region
>>> historian or grep in the master logs -- 'grep -v metaScanner 
>>> you avoid the metaScanner noise -- to see if you can figure the regions
>>> history around the failure.  Look too at loading around failure 
>>> time.  Were
>>> you swapping, etc. (Ganglia or some such helps here).
>>> You might also test table is still wholesome -- that the MR job didn't
>>> damage the table.  A quick check that all regions are onlined and 
>>> accessible
>>> is to scan for a column whose column family does exist but whose 
>>> qualifier
>>> you know is not present: e.g. if you have columnfamily 'page' and 
>>> you know
>>> there is no column 'page:xyz', scan with that (Enable DEBUG in log4j 
>>> so you
>>> can see regions being loaded as scan progresses): "scan 'TABLENAME',
>>> ['page:xyz']".
>>> You might need to up the timeouts/retries.
>>> St.Ack
>>> Dru Jensen wrote:
>>>> Hi hbase-users,
>>>> During a fairly large MR process, on the Reduce cycle as its 
>>>> writing its
>>>> results to a table, I see 
>>>> org.apache.hadoop.hbase.NotServingRegionException
>>>> in the region server log several times and then I see a split 
>>>> reporting it
>>>> was successful.
>>>> Eventually, the Reduce process fails with
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 
>>>> failed
>>>> attempts.
>>>> What can I do to fix it?
>>>> Thanks,
>>>> Dru

View raw message