hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-10131) NetWorkTopology#countNumOfAvailableNodes() is returning wrong value if excluded nodes passed are not part of the cluster tree
Date Sat, 20 Sep 2014 08:34:34 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinayakumar B updated HADOOP-10131:
-----------------------------------
    Description: 
I got "File /hdfs_COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).
 There are 1 datanode(s) running and 1 node(s) are excluded in this operation." in the following
case

1. 2 DNs cluster,
2. One of the datanodes was not responding from last 10 min, but about to detect as dead at
NN.
3. Tried to write one file, for the block NN allocated both DNs.
4. Client While creating the pipeline took some time to detect one node failure.
5. Before client detects pipeline failure, NN side dead node was removed from cluster map.
6. Now, client has abandoned previous block and asked for new block with dead node in excluded
list and got above exception even though one more node was available live.

When I dig this more, found that,
{{NetWorkTopology#countNumOfAvailableNodes()}} is not giving correct count when the excludeNodes
passed from client are not part of the cluster map.


Adding to this one more case where count is wrong.
1. If there is no node present for the normalized scope in cluster.

  was:
I got "File /hdfs_COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).
 There are 1 datanode(s) running and 1 node(s) are excluded in this operation." in the following
case

1. 2 DNs cluster,
2. One of the datanodes was not responding from last 10 min, but about to detect as dead at
NN.
3. Tried to write one file, for the block NN allocated both DNs.
4. Client While creating the pipeline took some time to detect one node failure.
5. Before client detects pipeline failure, NN side dead node was removed from cluster map.
6. Now, client has abandoned previous block and asked for new block with dead node in excluded
list and got above exception even though one more node was available live.

When I dig this more, found that,
{{NetWorkTopology#countNumOfAvailableNodes()}} is not giving correct count when the excludeNodes
passed from client are not part of the cluster map.



> NetWorkTopology#countNumOfAvailableNodes() is returning wrong value if excluded nodes
passed are not part of the cluster tree
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-10131
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10131
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.0.5-alpha
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>         Attachments: HADOOP-10131.patch, HDFS-5112.patch
>
>
> I got "File /hdfs_COPYING_ could only be replicated to 0 nodes instead of minReplication
(=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation." in the
following case
> 1. 2 DNs cluster,
> 2. One of the datanodes was not responding from last 10 min, but about to detect as dead
at NN.
> 3. Tried to write one file, for the block NN allocated both DNs.
> 4. Client While creating the pipeline took some time to detect one node failure.
> 5. Before client detects pipeline failure, NN side dead node was removed from cluster
map.
> 6. Now, client has abandoned previous block and asked for new block with dead node in
excluded list and got above exception even though one more node was available live.
> When I dig this more, found that,
> {{NetWorkTopology#countNumOfAvailableNodes()}} is not giving correct count when the excludeNodes
passed from client are not part of the cluster map.
> Adding to this one more case where count is wrong.
> 1. If there is no node present for the normalized scope in cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message