hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xudong Cao (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-16677) Recalculate the remaining timeout millis correctly while throwing an InterupptedException in SocketIOWithTimeout.
Date Thu, 31 Oct 2019 08:27:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xudong Cao updated HADOOP-16677:
--------------------------------
    Description: 
In SocketIOWithTimeout, when a thread was interrupted and exit from select(), it proceed to
throw an InterruptedIOException, in exception message the remaining timeout mills should
be recalcuated correctly rather than simply give a total timeout millis ,  otherwise it could
be very misleading.

For example, if an hdfs writer has not sent any packet to the pipeline more than 60s (e.g.
full gc or network issues),  then one of pipeline datanodes may be timeout and close its
sockets to other dns,  so its upstream DN's PacketResponder will immediately meet an EOF
and then interrupt its own DataXceiver, finally its DataXeiver will print some logs like:
{code:java}
2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 09:22:58,212
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/10.196.146.114:9003 remote=/10.215.153.105:38559]. 60000 millis timeout left.
 at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 ...
 
{code}
 

This log is very misleading because a 60000 mills timeout left implies that the DataXceiver
never blocks in select(), and this is unrealistic.  in fact, the truly timeout mills left
should be: 60000 - timeElapsedWhenSelect.

 

Finally, a properly log should be like this, the key improvment is "*Total timeout mills is
60000, 0 millis timeout left*":  
{code:java}
2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 16:20:39,496 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/192.168.202.12:50010 remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0
millis timeout left.
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
...
{code}

  was:
In SocketIOWithTimeout, when a thread was interrupted and exit from select(), it proceed to
throw an InterruptedIOException, in exception message the remaining timeout mills should
be recalcuated correctly rather than simply give a total timeout millis ,  otherwise it could
be very misleading.

For example, if an hdfs writer has not sent any packet to the pipeline more than 60s (e.g.
full gc or network issues),  then one of pipeline datanodes may be timeout and close its
sockets to other dns,  so its upstream DN's PacketResponder will immediately meet an EOF
and then interrupt its own DataXceiver, finally its DataXeiver will print some logs like:
{code:java}
2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 09:22:58,212
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/10.196.146.114:9003 remote=/10.215.153.105:38559]. 60000 millis timeout left.
 at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 ...
 
{code}
This log is very misleading because a 60000 mills timeout left implies that the DataXceiver
never blocks in select(), and this is unrealistic.  in fact, the truly timeout mills left
should be: 60000 - timeElapsedWhenSelect.

Finally, a properly log should be like this, the key improvment is "*Total timeout mills is
60000, 0 millis timeout left*":  
{code:java}
2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 16:20:39,496 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/192.168.202.12:50010 remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0
millis timeout left.
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
...
{code}


> Recalculate the remaining timeout millis correctly while throwing an InterupptedException
in SocketIOWithTimeout.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16677
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16677
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Minor
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(), it proceed
to throw an InterruptedIOException, in exception message the remaining timeout mills should
be recalcuated correctly rather than simply give a total timeout millis ,  otherwise it could
be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more than 60s
(e.g. full gc or network issues),  then one of pipeline datanodes may be timeout and close
its sockets to other dns,  so its upstream DN's PacketResponder will immediately meet an
EOF and then interrupt its own DataXceiver, finally its DataXeiver will print some logs like:
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 09:22:58,212
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/10.196.146.114:9003 remote=/10.215.153.105:38559]. 60000 millis timeout left.
>  at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
>  at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>  ...
>  
> {code}
>  
> This log is very misleading because a 60000 mills timeout left implies that the DataXceiver
never blocks in select(), and this is unrealistic.  in fact, the truly timeout mills left
should be: 60000 - timeElapsedWhenSelect.
>  
> Finally, a properly log should be like this, the key improvment is "*Total timeout mills
is 60000, 0 millis timeout left*":  
> {code:java}
> 2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 16:20:39,496
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/192.168.202.12:50010 remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0
millis timeout left.
> at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
> at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message