tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kurt Muehlner (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer
Date Thu, 24 Mar 2016 16:48:25 GMT
Kurt Muehlner created TEZ-3187:
----------------------------------

             Summary: Pig on tez hang with java.io.IOException: Connection reset by peer
                 Key: TEZ-3187
                 URL: https://issues.apache.org/jira/browse/TEZ-3187
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.2
         Environment: Hadoop 2.5.0
Pig 0.15.0
Tez 0.8.2
            Reporter: Kurt Muehlner


We are experiencing occasional application hangs, when testing an existing Pig MapReduce script,
executing on Tez.  When this occurs, we find this in the syslog for the executing dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No
taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822,
containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112,
delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No
taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824,
containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111,
delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader
#1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException:
Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No
taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811,
containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110,
delayedContainers=25, isNew=false

In all cases I've been able to analyze so far, this also correlates with a warning in the
node identified in the IOException:

2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] |retry.RetryInvocationHandler|:
A failover has occurred since the start of this method invocation attempt.

However, it does not appear that any namenode failover has actually occurred (the most recent
failover we see in logs is from 2015).

Attached:
syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
10.102.173.86.logs.gz: aggregated logs from the host identified in the IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message