hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava Gorelik" <slava.gore...@gmail.com>
Subject Re: Regionserver fails to serve region
Date Tue, 04 Nov 2008 08:26:37 GMT
Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss Clock.
Worked with 8 clients about 30 hours intensive load.

Just small question, after about 28 hours (when i came back to work) i found
that one of 7 datanodes in Hadoop is about 98% usage and all other about
30%, is it normal ?

Best Regards.



On Fri, Oct 31, 2008 at 10:16 PM, Slava Gorelik <slava.gorelik@gmail.com>wrote:

> Hi.No problem with silly question :-) Yes, sure i replaced, here the list
> of folder that begins with 73*:
>
> drwxr-xr-x   - XXXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/732078971/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/732215319/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/733411255/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:14 /hbase/BizDB/733598097/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 10:50 /hbase/BizDB/734145833/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:09 /hbase/BizDB/735612900/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:15 /hbase/BizDB/738009120/BusinessObject
>
> There no 735893330 folder.
> Scanning.META. in shell is not easy at all. .META. is huge and simple scan
> without providing specific column will give me about 10 min only listing of
> .META. content, so i failed  to find the 735893330, may be you can give me
> the name of the column, where this info is placed ?
>
> I think i'll reformat the HDFS and will start it from clean environment and
> the we'll see. I'll do it this Sunday and let you know.
>
> Best Regards and Big Thank You for your patience and assistance.
>
>
> On Fri, Oct 31, 2008 at 4:47 AM, Michael Stack <stack@duboce.net> wrote:
>
>> Slava Gorelik wrote:
>>
>>> Hi.I also noticed this exception.
>>> Strange that this exception is happened every time on the same
>>> regionserver.
>>> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
>>>  Very strange, but history folder in hadoop is empty.
>>>
>>>
>> It is odd indeed that the system keeps trying to load a region that does
>> not exist.
>>
>> I don't think it necessarily the same regionserver that is responsible.
>>  I'd think it an attribute of the region that we're trying to deploy on that
>> server.
>>
>> Silly question: you did replace 'X' with your machine name in the above?
>>
>> If you restart, it still tries to load this nonexistent region?
>>
>> If so, the .META. table is not consistent with whats on the filesystem.
>>  They've gotten out of sync.  Describing how to repair is involved.
>>
>>  Reformatting HDFS  will help ?
>>>
>>>
>>>
>> Do a "scan '.META.'" in the shell.  Do you see your region listed (look at
>> the encoded names attribute to find 735893330.
>>
>> If your table is damaged -- i'd guess it because ulimit was bad up to this
>> -- the best thing might to start over.
>>
>>  One more things in a last minute, i found that one node in cluster has
>>> totally different time, could this cause for such a problems ?
>>>
>>>
>> We thought we'd fixed all problems that could arise from time skew, but
>> you never know.  In our requirements, clocks must be synced.  Fix this too
>> if you can before reloading.
>>
>>  P.S. About logs, is it possible to send to some email - each log file
>>> compressed is about 1mb, and only in 3 files i found exceptions.
>>>
>>>
>>>
>> There probably is such a functionality but I'm not familiar.  Can you put
>> them under a webserver at your place so I can grab them?  You can send me
>> the URL offlist if you like.
>>
>> Thanks for your patience Slava.  We'll figure it.
>>
>> St.Ack
>>
>>
>>  On Thu, Oct 30, 2008 at 10:25 PM, stack <stack@duboce.net> wrote:
>>>
>>>
>>>
>>>> Can you put them someplace that I can pull them?
>>>>
>>>> I took another look at your logs.  I see that a region is missing files.
>>>>  That means it will never open and just keep trying.  Grep your logs for
>>>> FileNotFound.  You'll see this:
>>>>
>>>>
>>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>>> File does not exist:
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
>>>>
>>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>>> File does not exist:
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>>>>
>>>> Try shutting down, and removing these files.   Remove the following
>>>> directories:
>>>>
>>>>
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>>>>
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>>>>
>>>> Then retry restarting.
>>>>
>>>> You can try and figure how these files got lost by going back in your
>>>> history.
>>>>
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Michael,still have the problem, but the logs files are very big (50MB
>>>>> each)
>>>>> even compressed they are bigger than limit for this mailing list.
>>>>> Most of the problems are happened during compaction (i see in the log),
>>>>> may
>>>>> be i can send some parts from logs ?
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <
>>>>> slava.gorelik@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating
now,
>>>>>> soon
>>>>>> will try again.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <
>>>>>> slava.gorelik@gmail.com
>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>>>>> I attached the limits.conf, please have a  look, may be i did
it
>>>>>>> wrong.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net>
wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thanks for the logs Slava.  I notice that you have not upped
the
>>>>>>>> ulimit
>>>>>>>> on your cluster.  See the head of your logs where we print
out the
>>>>>>>> ulimit.
>>>>>>>>  Its 1024.  This could be one cause of your grief especially
when
>>>>>>>> you
>>>>>>>> seemingly have many regions (>1000).  Please try upping
it.
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>> I enabled DEBUG log level and now I'm sending all logs
(archived)
>>>>>>>>> including fsck run result.
>>>>>>>>> Today my program starting to fail couple of minutes from
the begin,
>>>>>>>>> it's
>>>>>>>>> very easy to reproduce the problem, cluster became very
unstable.
>>>>>>>>>
>>>>>>>>> Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net
<mailto:
>>>>>>>>> stack@duboce.net>> wrote:
>>>>>>>>>
>>>>>>>>>  See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>>>>
>>>>>>>>>  St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>      Hi.First of all i want to say thank you for you
assistance !!!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      DEBUG on hadoop or hbase ? And how can i enable
?
>>>>>>>>>      fsck said that HDFS is healthy.
>>>>>>>>>
>>>>>>>>>      Best Regards and Thank You
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>>>>      <mailto:stack@duboce.net>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              Hi.HDFS capacity is about 800gb (8 datanodes)
and the
>>>>>>>>>              current usage is
>>>>>>>>>              about
>>>>>>>>>              30GB. This is after total re-format of the
HDFS that
>>>>>>>>>              was made a hour
>>>>>>>>>              before.
>>>>>>>>>
>>>>>>>>>              BTW, the logs i sent are from the first
exception that
>>>>>>>>>              i found in them.
>>>>>>>>>              Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Please enable DEBUG and retry.  Send me all
logs.  What
>>>>>>>>>          does the fsck on
>>>>>>>>>          HDFS say?  There is something seriously wrong
with your
>>>>>>>>>          cluster that you are
>>>>>>>>>          having so much trouble getting it running. 
Lets try and
>>>>>>>>>          figure it.
>>>>>>>>>
>>>>>>>>>          St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>>>>              <stack@duboce.net <mailto:stack@duboce.net>>
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  I took a quick look Slava (Thanks for
sending the
>>>>>>>>>                  files).   Here's a few
>>>>>>>>>                  notes:
>>>>>>>>>
>>>>>>>>>                  + The logs are from after the damage
is done; the
>>>>>>>>>                  transition from good to
>>>>>>>>>                  bad is missing.  If I could see that,
that would
>>>>>>>>> help
>>>>>>>>>                  + But what seems to be plain is that
that your
>>>>>>>>>                  HDFS is very sick.  See
>>>>>>>>>                  this
>>>>>>>>>                  from head of one of the regionserver
logs:
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>>                  org.apache.hadoop.dfs.DFSClient:
>>>>>>>>>                  DataStreamer
>>>>>>>>>                  Exception: java.io.IOException: Unable
to create
>>>>>>>>>                  new block.
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>>                  org.apache.hadoop.dfs.DFSClient: Error
>>>>>>>>>                  Recovery for block blk_-5188192041705782716_60000
>>>>>>>>>                  bad datanode[0]
>>>>>>>>>                  2008-10-27 23:41:12,685 ERROR
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>>>>                  Compaction/Split
>>>>>>>>>                  failed for region
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>>>>                  java.io.IOException: Could not get block
>>>>>>>>>                  locations. Aborting...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  If HDFS is ailing, hbase is too.  In
fact, the
>>>>>>>>>                  regionservers will shut
>>>>>>>>>                  themselves to protect themselves against
damaging
>>>>>>>>>                  or losing data:
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,688 FATAL
>>>>>>>>>                  org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>>>>                  Replay of hlog required. Forcing server
restart
>>>>>>>>>
>>>>>>>>>                  So, whats up with your HDFS?  Not enough
space
>>>>>>>>>                  alloted?  What happens if
>>>>>>>>>                  you run "./bin/hadoop fsck /"?  Does
that give you
>>>>>>>>>                  a clue as to what
>>>>>>>>>                  happened?  Dig in the datanode and namenode
logs.
>>>>>>>>>                   Look for where the
>>>>>>>>>                  exceptions start.  It might give you
a clue.
>>>>>>>>>
>>>>>>>>>                  + The suse regionserver log had garbage
in it.
>>>>>>>>>
>>>>>>>>>                  St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                      Hi.
>>>>>>>>>                      My happiness was very short :-(
After i
>>>>>>>>>                      successfully added 1M rows (50k
>>>>>>>>>                      each row) i tried to add 10M rows.
>>>>>>>>>                      And after 3-4 working hours it started
to
>>>>>>>>>                      dying. First one region server
>>>>>>>>>                      is died, after another one and eventually
all
>>>>>>>>>                      cluster is dead.
>>>>>>>>>
>>>>>>>>>                      I attached log files (relevant part,
archived)
>>>>>>>>>                      from region servers and
>>>>>>>>>                      from the master.
>>>>>>>>>
>>>>>>>>>                      Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                      On Mon, Oct 27, 2008 at 11:19 AM,
Slava
>>>>>>>>> Gorelik
>>>>>>>>> <
>>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>>                      <mailto:slava.gorelik@gmail.com><mailto:
>>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>>                      <mailto:slava.gorelik@gmail.com>>>
wrote:
>>>>>>>>>
>>>>>>>>>                       Hi.
>>>>>>>>>                       So far so good, after changing
the file
>>>>>>>>>                      descriptors
>>>>>>>>>                       and dfs.datanode.socket.write.timeout,
>>>>>>>>>                      dfs.datanode.max.xcievers
>>>>>>>>>                       my cluster works stable.
>>>>>>>>>                       Thank You and Best Regards.
>>>>>>>>>
>>>>>>>>>                       P.S. Regarding deleting multiple
columns
>>>>>>>>>                      missing functionality i
>>>>>>>>>                       filled jira :
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-961
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                       On Sun, Oct 26, 2008 at 12:58 AM,
Michael
>>>>>>>>>                      Stack <stack@duboce.net <mailto:
>>>>>>>>> stack@duboce.net
>>>>>>>>>                                 <mailto:stack@duboce.net
>>>>>>>>>
>>>>>>>>>                      <mailto:stack@duboce.net>>>
wrote:
>>>>>>>>>
>>>>>>>>>                           Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>                               Hi.Haven't tried yet them,
i'll try
>>>>>>>>>                      tomorrow morning. In
>>>>>>>>>                               general cluster is
>>>>>>>>>                               working well, the problems
begins if
>>>>>>>>>                      i'm trying to add 10M
>>>>>>>>>                               rows, after 1.2M
>>>>>>>>>                               if happened.
>>>>>>>>>
>>>>>>>>>                           Anything else running beside
the
>>>>>>>>>                      regionserver or datanodes
>>>>>>>>>                           that would suck resources?
 When
>>>>>>>>>                      datanodes begin to slow, we
>>>>>>>>>                           begin to see the issue Jean-Adrien's
>>>>>>>>>                      configurations address.
>>>>>>>>>                            Are you uploading using MapReduce?
 Are
>>>>>>>>>                      TTs running on same
>>>>>>>>>                           nodes as the datanode and regionserver?
>>>>>>>>>                       How are you doing the
>>>>>>>>>                           upload?  Describe what your
uploader
>>>>>>>>>                      looks like (Sorry if
>>>>>>>>>                           you've already done this).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                I already changed the
limit of files
>>>>>>>>>                      descriptors,
>>>>>>>>>
>>>>>>>>>                           Good.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                I'll try
>>>>>>>>>                               to change the properties:
>>>>>>>>>                                <property>
>>>>>>>>>                      <name>dfs.datanode.socket.write.timeout</name>
>>>>>>>>>                                <value>0</value>
>>>>>>>>>                               </property>
>>>>>>>>>
>>>>>>>>>                               <property>
>>>>>>>>>
>>>>>>>>>  <name>dfs.datanode.max.xcievers</name>
>>>>>>>>>                                <value>1023</value>
>>>>>>>>>                               </property>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                           Yeah, try it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                               And let you know, is any
other
>>>>>>>>>                      prescriptions ? Did i miss
>>>>>>>>>                               something ?
>>>>>>>>>
>>>>>>>>>                               BTW, off topic, but i sent
e-mail
>>>>>>>>>                      recently to the list and
>>>>>>>>>                               i can't see it:
>>>>>>>>>                               Is it possible to delete
multiple
>>>>>>>>>                      columns in any way by
>>>>>>>>>                               regex : for example
>>>>>>>>>                               colum_name_* ?
>>>>>>>>>
>>>>>>>>>                           Not that I know of.  If its
not in the
>>>>>>>>>                      API, it should be.
>>>>>>>>>                            Mind filing a JIRA?
>>>>>>>>>
>>>>>>>>>                           Thanks Slava.
>>>>>>>>>                           St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message