chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage
Date Mon, 17 May 2010 22:54:45 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868432#action_12868432
] 

Ari Rabkin commented on CHUKWA-487:
-----------------------------------

My sense of this is that failure-recovery logic is hard to write and hard to test. The bug
that prompted this JIRA was a failure-handler bug. It's also not clear to me what the benefit
is.  Chukwa has mechanisms for agent-side retry, via CHUKWA-369.  

You can turn agent-side recovery on using option httpConnector.asyncAcks.

This should all be documented at some point.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>         Attachments: CHUKWA-487.patch, CHUKWA-487.threaddump.txt
>
>
> When the name node returns errors to the collector, at some point the collector dies
half way. This behavior should be changed to either resemble the agents and keep trying, or
to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting
down, and the var/pidDir/Collector.pid file gets removed, but the collector continues to run,
albeit not handling new data. Instead, this log entry is repeated ad infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message