chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerome Boulon (JIRA)" <>
Subject [jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage
Date Fri, 14 May 2010 20:16:43 GMT


Jerome Boulon commented on CHUKWA-487:

I looked quickly at the code and that's something I changed in Honu because of a possible
"virtual dead-lock".
The main thread (writer) tries to acquire the lock to write to the sequence file while the
shutdownHook will also try to get it.
The issue is that the Writer get the lock for about 2 minutes (NN retries) and chance are
that statistically the next one to get the lock will also be a queue add instead of the close.
One quick workaround for now will be to wait no-longer than xxx sec on the close and also
to configure your NN client to fail fast instead of retrying again and again in the specific
case of NN is not available.

Longer term could be to back off at the SeqFileWriter.add method. 
In Honu, I've put some timeOut aorund all locks & addQueue to make sure that nobody get
blocked on a lock + I have a TRY_LATER error that is returned if a the add takes more than
2 seconds and READY/!READY state to accept/reject incoming requests.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>                 Key: CHUKWA-487
>                 URL:
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>         Attachments: CHUKWA-487.threaddump.txt
> When the name node returns errors to the collector, at some point the collector dies
half way. This behavior should be changed to either resemble the agents and keep trying, or
to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting
down, and the var/pidDir/ file gets removed, but the collector continues to run,
albeit not handling new data. Instead, this log entry is repeated ad infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message