hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Forehand <luke.foreh...@networkedinsights.com>
Subject Re: Hanging regionservers
Date Mon, 19 Jul 2010 05:20:47 GMT
Yes the hang occurred with timeout set to 0 on CDH2.

On 7/18/10 4:12 PM, "Stack" <stack@duboce.net> wrote:

This is a hang with timeout set to 0 but on CDH2?
St.Ack

On Sun, Jul 18, 2010 at 1:36 PM, Luke Forehand
<luke.forehand@networkedinsights.com> wrote:
> I experienced the hang on my second job attempt.  I will be pastebinning stacktraces
and logs of all three servers tonight.  The datanode log of one of the servers is way bigger
than the rest and that's all the analysis I've done so far.  Meeting with cloudera on Monday
and they'll probably want me to migrate to CDH3.  Need to mow the lawn...  I'll report back
soon.
>
> -Luke
>
> On 7/16/10 6:34 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:
>
> According to Todd, there is some kind of weird Thread coordination
> issue which is worked around by setting the timeout to 0, even though
> we actually arent hitting any timeouts in the failure case.
>
> And it might have been fixed in cdh3.  I haven't had chance to run it
> yet so I can't say.
>
> -ryan
>
> On Fri, Jul 16, 2010 at 3:32 PM, Stack <stack@duboce.net> wrote:
>> So, it seems like you are by-passing issue by having no time out on
>> the socket.  Would be for sure interested though if you have the issue
>> still on cdh3b2.  Most folks will not be running with no socket
>> timeout.
>>
>> Thanks Luke.
>> St.Ack
>>
>>
>> On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand
>> <luke.forehand@networkedinsights.com> wrote:
>>> Using Ryan Rawson's suggested config tweaks, we have just completed a successful
job run with a 15GB sequence file, no hang.  I'm setting up to have multiple files process
this weekend with the new settings.  :-)  I believe the dfs socket write timeout being indefinite
was the trick.
>>>
>>> I'll post my results on Monday.  Thanks for the support thus far!
>>>
>>> -Luke
>>>
>>> On 7/15/10 10:17 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:
>>>
>>> I'm not seeing anything in that logfile, you are seeing compactions
>>> for various regions, but im not seeing flushes (typical during insert
>>> loads) and nothing else. One thing we look to see is a log message
>>> "Blocking updates" which indicates that a particular region has
>>> decided it's holding up to prevent taking too many inserts.
>>>
>>> Like I said, you could be seeing this on a different regionserver, if
>>> all the clients are blocked on 1 regionserver and can't get to the
>>> others then most will look idle and only one will actually show
>>> anything interesting in the log.
>>>
>>> Can you check for this behaviour?
>>>
>>> Also if you want to tweak the config with the values I pasted that should help.
>>>
>>> On Thu, Jul 15, 2010 at 7:25 PM, Luke Forehand
>>> <luke.forehand@networkedinsights.com> wrote:
>>>> It looks like we are going straight from the default config, no expicit setting
of anything.
>>>>
>>>> On 7/15/10 9:03 PM, "Ryan Rawson" <ryanobjc@gmail.com> wrote:
>>>>
>>>> In this case the regionserver isn't actually doing anything - all the
>>>> IPC thread handlers are waiting in their queue handoff thingy (how
>>>> they get socket/work to do).
>>>>
>>>> Something elsewhere perhaps?  Check the logs of your jobs, there might
>>>> be something interesting there.
>>>>
>>>> One thing that frequently happens is you overrun 1 regionserver with
>>>> edits and it isnt flushing fast enough, so it pauses updates and all
>>>> clients end up stuck on it.
>>>>
>>>> What was that config again?  I use these settings:
>>>>
>>>> <property>
>>>>  <name>hbase.hstore.blockingStoreFiles</name>
>>>>  <value>15</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>dfs.datanode.socket.write.timeout</name>
>>>>  <value>0</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>hbase.hregion.memstore.block.multiplier</name>
>>>>  <value>8</value>
>>>> </property>
>>>>
>>>> perhaps try these ones?
>>>>
>>>> -ryan
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message