hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5796) DFS Write pipeline does not detect defective datanode correctly in some cases (HADOOP-3339)
Date Tue, 24 Nov 2009 23:21:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782231#action_12782231

Todd Lipcon commented on HADOOP-5796:

Not certain if what I"m seeing is the exact same cause, but I have another reproducible case
in which the write pipeline recovery decides the first node is dead every time, when in actuality
it's the last node that's dead. In my case, I've set up a 3-node HDFS cluster with replication
3, and each DN having one 100G volume and one 2G volume. The 2Gs fill up, throw DiskOutOfSpaceExceptions,
and the write pipeline recovers incorrectly when the node that runs out of space is the last.
It first ejects pipeline[0], fails again when trying to continue the write on the dead node,
ejects the second, then tries again writing only to the failed node. Of course that fails
too, and the whole write is aborted.

I'll try applying this patch (and thinking it through a bit further) and seeing if it resolves
the issue.

> DFS Write pipeline does not detect defective datanode correctly in some cases (HADOOP-3339)
> -------------------------------------------------------------------------------------------
>                 Key: HADOOP-5796
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5796
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Raghu Angadi
>             Fix For: 0.20.2
>         Attachments: toreproduce-5796.patch
> HDFS write pipeline does not select the correct datanode in some error cases. One example
: say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline
actually removes the first datanode. If such a datanode happens to be the last one in the
pipeline, write is aborted completely with a hard error.
> Essentially the error occurs when writing to a downstream datanode fails rather than
reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted
it. I am not sure why.
> It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline.
We should not have at least known bugs that lead to hard failures.
> I will attach patch for a hack that illustrates this problem. Still thinking of how an
automated test would look like for this one. 
> My preferred target for  this fix is 0.20.1.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message