hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava Gorelik" <slava.gore...@gmail.com>
Subject Re: Regionserver fails to serve region
Date Fri, 31 Oct 2008 20:16:36 GMT
Hi.No problem with silly question :-) Yes, sure i replaced, here the list of
folder that begins with 73*:

drwxr-xr-x   - XXXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/732078971/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/732215319/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/733411255/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:14
/hbase/BizDB/733598097/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 10:50
/hbase/BizDB/734145833/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:09
/hbase/BizDB/735612900/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:15
/hbase/BizDB/738009120/BusinessObject

There no 735893330 folder.
Scanning.META. in shell is not easy at all. .META. is huge and simple scan
without providing specific column will give me about 10 min only listing of
.META. content, so i failed  to find the 735893330, may be you can give me
the name of the column, where this info is placed ?

I think i'll reformat the HDFS and will start it from clean environment and
the we'll see. I'll do it this Sunday and let you know.

Best Regards and Big Thank You for your patience and assistance.


On Fri, Oct 31, 2008 at 4:47 AM, Michael Stack <stack@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.I also noticed this exception.
>> Strange that this exception is happened every time on the same
>> regionserver.
>> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
>>  Very strange, but history folder in hadoop is empty.
>>
>>
> It is odd indeed that the system keeps trying to load a region that does
> not exist.
>
> I don't think it necessarily the same regionserver that is responsible.
>  I'd think it an attribute of the region that we're trying to deploy on that
> server.
>
> Silly question: you did replace 'X' with your machine name in the above?
>
> If you restart, it still tries to load this nonexistent region?
>
> If so, the .META. table is not consistent with whats on the filesystem.
>  They've gotten out of sync.  Describing how to repair is involved.
>
>  Reformatting HDFS  will help ?
>>
>>
>>
> Do a "scan '.META.'" in the shell.  Do you see your region listed (look at
> the encoded names attribute to find 735893330.
>
> If your table is damaged -- i'd guess it because ulimit was bad up to this
> -- the best thing might to start over.
>
>  One more things in a last minute, i found that one node in cluster has
>> totally different time, could this cause for such a problems ?
>>
>>
> We thought we'd fixed all problems that could arise from time skew, but you
> never know.  In our requirements, clocks must be synced.  Fix this too if
> you can before reloading.
>
>  P.S. About logs, is it possible to send to some email - each log file
>> compressed is about 1mb, and only in 3 files i found exceptions.
>>
>>
>>
> There probably is such a functionality but I'm not familiar.  Can you put
> them under a webserver at your place so I can grab them?  You can send me
> the URL offlist if you like.
>
> Thanks for your patience Slava.  We'll figure it.
>
> St.Ack
>
>
>  On Thu, Oct 30, 2008 at 10:25 PM, stack <stack@duboce.net> wrote:
>>
>>
>>
>>> Can you put them someplace that I can pull them?
>>>
>>> I took another look at your logs.  I see that a region is missing files.
>>>  That means it will never open and just keep trying.  Grep your logs for
>>> FileNotFound.  You'll see this:
>>>
>>>
>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>> File does not exist:
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
>>>
>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>> File does not exist:
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>>>
>>> Try shutting down, and removing these files.   Remove the following
>>> directories:
>>>
>>>
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>>>
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>>>
>>> Then retry restarting.
>>>
>>> You can try and figure how these files got lost by going back in your
>>> history.
>>>
>>>
>>> St.Ack
>>>
>>>
>>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Michael,still have the problem, but the logs files are very big (50MB
>>>> each)
>>>> even compressed they are bigger than limit for this mailing list.
>>>> Most of the problems are happened during compaction (i see in the log),
>>>> may
>>>> be i can send some parts from logs ?
>>>>
>>>> Best Regards.
>>>>
>>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>>>> soon
>>>>> will try again.
>>>>>
>>>>>
>>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <
>>>>> slava.gorelik@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>>>> I attached the limits.conf, please have a  look, may be i did it
>>>>>> wrong.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <stack@duboce.net> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks for the logs Slava.  I notice that you have not upped
the
>>>>>>> ulimit
>>>>>>> on your cluster.  See the head of your logs where we print out
the
>>>>>>> ulimit.
>>>>>>>  Its 1024.  This could be one cause of your grief especially
when you
>>>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi.
>>>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>>>> including fsck run result.
>>>>>>>> Today my program starting to fail couple of minutes from
the begin,
>>>>>>>> it's
>>>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>>>
>>>>>>>> Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net
<mailto:
>>>>>>>> stack@duboce.net>> wrote:
>>>>>>>>
>>>>>>>>  See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>>>
>>>>>>>>  St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>  Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>      Hi.First of all i want to say thank you for you assistance
!!!
>>>>>>>>
>>>>>>>>
>>>>>>>>      DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>>>      fsck said that HDFS is healthy.
>>>>>>>>
>>>>>>>>      Best Regards and Thank You
>>>>>>>>
>>>>>>>>
>>>>>>>>      On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>>>      <mailto:stack@duboce.net>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>          Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>              Hi.HDFS capacity is about 800gb (8 datanodes)
and the
>>>>>>>>              current usage is
>>>>>>>>              about
>>>>>>>>              30GB. This is after total re-format of the HDFS
that
>>>>>>>>              was made a hour
>>>>>>>>              before.
>>>>>>>>
>>>>>>>>              BTW, the logs i sent are from the first exception
that
>>>>>>>>              i found in them.
>>>>>>>>              Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>          Please enable DEBUG and retry.  Send me all logs.
 What
>>>>>>>>          does the fsck on
>>>>>>>>          HDFS say?  There is something seriously wrong with
your
>>>>>>>>          cluster that you are
>>>>>>>>          having so much trouble getting it running.  Lets
try and
>>>>>>>>          figure it.
>>>>>>>>
>>>>>>>>          St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>              On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>>>              <stack@duboce.net <mailto:stack@duboce.net>>
wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  I took a quick look Slava (Thanks for sending
the
>>>>>>>>                  files).   Here's a few
>>>>>>>>                  notes:
>>>>>>>>
>>>>>>>>                  + The logs are from after the damage is
done; the
>>>>>>>>                  transition from good to
>>>>>>>>                  bad is missing.  If I could see that, that
would
>>>>>>>> help
>>>>>>>>                  + But what seems to be plain is that that
your
>>>>>>>>                  HDFS is very sick.  See
>>>>>>>>                  this
>>>>>>>>                  from head of one of the regionserver logs:
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>                  org.apache.hadoop.dfs.DFSClient:
>>>>>>>>                  DataStreamer
>>>>>>>>                  Exception: java.io.IOException: Unable to
create
>>>>>>>>                  new block.
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>                  org.apache.hadoop.dfs.DFSClient: Error
>>>>>>>>                  Recovery for block blk_-5188192041705782716_60000
>>>>>>>>                  bad datanode[0]
>>>>>>>>                  2008-10-27 23:41:12,685 ERROR
>>>>>>>>
>>>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>>>                  Compaction/Split
>>>>>>>>                  failed for region
>>>>>>>>
>>>>>>>>
>>>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>>>                  java.io.IOException: Could not get block
>>>>>>>>                  locations. Aborting...
>>>>>>>>
>>>>>>>>
>>>>>>>>                  If HDFS is ailing, hbase is too.  In fact,
the
>>>>>>>>                  regionservers will shut
>>>>>>>>                  themselves to protect themselves against
damaging
>>>>>>>>                  or losing data:
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,688 FATAL
>>>>>>>>                  org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>>>                  Replay of hlog required. Forcing server
restart
>>>>>>>>
>>>>>>>>                  So, whats up with your HDFS?  Not enough
space
>>>>>>>>                  alloted?  What happens if
>>>>>>>>                  you run "./bin/hadoop fsck /"?  Does that
give you
>>>>>>>>                  a clue as to what
>>>>>>>>                  happened?  Dig in the datanode and namenode
logs.
>>>>>>>>                   Look for where the
>>>>>>>>                  exceptions start.  It might give you a clue.
>>>>>>>>
>>>>>>>>                  + The suse regionserver log had garbage
in it.
>>>>>>>>
>>>>>>>>                  St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>                  Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                      Hi.
>>>>>>>>                      My happiness was very short :-( After
i
>>>>>>>>                      successfully added 1M rows (50k
>>>>>>>>                      each row) i tried to add 10M rows.
>>>>>>>>                      And after 3-4 working hours it started
to
>>>>>>>>                      dying. First one region server
>>>>>>>>                      is died, after another one and eventually
all
>>>>>>>>                      cluster is dead.
>>>>>>>>
>>>>>>>>                      I attached log files (relevant part,
archived)
>>>>>>>>                      from region servers and
>>>>>>>>                      from the master.
>>>>>>>>
>>>>>>>>                      Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                      On Mon, Oct 27, 2008 at 11:19 AM, Slava
Gorelik
>>>>>>>> <
>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>                      <mailto:slava.gorelik@gmail.com><mailto:
>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>                      <mailto:slava.gorelik@gmail.com>>>
wrote:
>>>>>>>>
>>>>>>>>                       Hi.
>>>>>>>>                       So far so good, after changing the
file
>>>>>>>>                      descriptors
>>>>>>>>                       and dfs.datanode.socket.write.timeout,
>>>>>>>>                      dfs.datanode.max.xcievers
>>>>>>>>                       my cluster works stable.
>>>>>>>>                       Thank You and Best Regards.
>>>>>>>>
>>>>>>>>                       P.S. Regarding deleting multiple columns
>>>>>>>>                      missing functionality i
>>>>>>>>                       filled jira :
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/HBASE-961
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                       On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>>>                      Stack <stack@duboce.net <mailto:
>>>>>>>> stack@duboce.net
>>>>>>>>                                 <mailto:stack@duboce.net
>>>>>>>>
>>>>>>>>                      <mailto:stack@duboce.net>>>
wrote:
>>>>>>>>
>>>>>>>>                           Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>                               Hi.Haven't tried yet them,
i'll try
>>>>>>>>                      tomorrow morning. In
>>>>>>>>                               general cluster is
>>>>>>>>                               working well, the problems
begins if
>>>>>>>>                      i'm trying to add 10M
>>>>>>>>                               rows, after 1.2M
>>>>>>>>                               if happened.
>>>>>>>>
>>>>>>>>                           Anything else running beside the
>>>>>>>>                      regionserver or datanodes
>>>>>>>>                           that would suck resources?  When
>>>>>>>>                      datanodes begin to slow, we
>>>>>>>>                           begin to see the issue Jean-Adrien's
>>>>>>>>                      configurations address.
>>>>>>>>                            Are you uploading using MapReduce?
 Are
>>>>>>>>                      TTs running on same
>>>>>>>>                           nodes as the datanode and regionserver?
>>>>>>>>                       How are you doing the
>>>>>>>>                           upload?  Describe what your uploader
>>>>>>>>                      looks like (Sorry if
>>>>>>>>                           you've already done this).
>>>>>>>>
>>>>>>>>
>>>>>>>>                                I already changed the limit
of files
>>>>>>>>                      descriptors,
>>>>>>>>
>>>>>>>>                           Good.
>>>>>>>>
>>>>>>>>
>>>>>>>>                                I'll try
>>>>>>>>                               to change the properties:
>>>>>>>>                                <property>
>>>>>>>>                      <name>dfs.datanode.socket.write.timeout</name>
>>>>>>>>                                <value>0</value>
>>>>>>>>                               </property>
>>>>>>>>
>>>>>>>>                               <property>
>>>>>>>>
>>>>>>>>  <name>dfs.datanode.max.xcievers</name>
>>>>>>>>                                <value>1023</value>
>>>>>>>>                               </property>
>>>>>>>>
>>>>>>>>
>>>>>>>>                           Yeah, try it.
>>>>>>>>
>>>>>>>>
>>>>>>>>                               And let you know, is any other
>>>>>>>>                      prescriptions ? Did i miss
>>>>>>>>                               something ?
>>>>>>>>
>>>>>>>>                               BTW, off topic, but i sent
e-mail
>>>>>>>>                      recently to the list and
>>>>>>>>                               i can't see it:
>>>>>>>>                               Is it possible to delete multiple
>>>>>>>>                      columns in any way by
>>>>>>>>                               regex : for example
>>>>>>>>                               colum_name_* ?
>>>>>>>>
>>>>>>>>                           Not that I know of.  If its not
in the
>>>>>>>>                      API, it should be.
>>>>>>>>                            Mind filing a JIRA?
>>>>>>>>
>>>>>>>>                           Thanks Slava.
>>>>>>>>                           St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message