hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <>
Subject [jira] [Commented] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion
Date Fri, 24 Apr 2015 19:42:38 GMT


Prasanth Jayachandran commented on HIVE-10480:

This spurious access$300 exception is due to task reporter callable becoming null (or some
other fields). This may be due to task reporter being shutdown. Reporter gets shutdown when
AM sends shouldDie signal. AM sends shouldDie signal when TezTaskRunner sends failure notification.
Looks like some information is missing (why tez task runner sends failure). Can you attach
the entire log if its not big?

> LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails
to report completion
> -----------------------------------------------------------------------------------------------------------
>                 Key: HIVE-10480
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
> No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the cluster), or
all 3.
> So for now I will just dump all I have here.
> TPCH Q1 started running for a long time for me on large number of runs today (didn't
happen yesterday). It would always be one Map task timing out.
>  Example attempt (logs from am):
> {noformat}
> 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] tezplugins.LlapTaskCommunicator:
Successfully launched task: attempt_1429683757595_0321_9_00_000928_0
> 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] history.HistoryEventHandler:
[HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0,
startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR,
diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out after 300 secs, counters=Counters:
1, org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
> {noformat}
> No other lines for this attempt in between.
> However there's this:
> {noformat}
> 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: Unable to
read call parameters for client connection protocol org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol
for rpcKind RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException
> 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader
#1 for port 59446: readAndProcess from client threw exception [org.apache.hadoop.ipc.RpcServerException:
IPC server unable to read call parameters: null]
> {noformat}
> On LLAP, the following is logged 
> {noformat}
> 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR org.apache.tez.runtime.task.TezTaskRunner:
TaskReporter reported error
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC
server unable to read call parameters: null
>         at
>         at
>         at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(
>         at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$
>         at
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at
> {noformat}
> The attempt starts but is then interrupted (not clear by whom)
> {noformat}
> 2015-04-24 11:11:01,144 [Initializer 0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Initialized Input with
src edge: lineitem
> 2015-04-24 11:11:01,145 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error while executing
task: attempt_1429683757595_0321_9_00_000928_0
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(
>         at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(
>         at java.util.concurrent.LinkedBlockingQueue.take(
>         at java.util.concurrent.ExecutorCompletionService.take(
>         at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$
> {noformat}
> There's a heartbeat error that may or may not be spurious:
> {noformat}
> 2015-04-24 11:11:01,146 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the following exception
since a previous exception is already registered
> java.lang.NullPointerException
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.access$300(
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter.taskFailed(
>         at org.apache.tez.runtime.task.TezTaskRunner.sendFailure(
>         at org.apache.tez.runtime.task.TezTaskRunner.access$600(
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$
>         at Method)
>         at
>         at
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(
> {noformat}
> And nothing more for this task.
> Task succeeds on rerun. Other tasks also succeed on this node meanwhile.
> 1) Is it Hadoop IPC bug due to new patch?
> 2) Even if so (assuming IPC is not totally broken) I wonder if the Heartbeat NPE (which
we have seen before, without the IPC patch too IIRC) is actually a real issue that prevents
the task from being sent to AM?
> 3) Who interrupts the task and why? AM doesn't have any logs about that and it happens
after RPC error, not before.
> Btw, there's another flavor or RPC error:
> {noformat}
> 2015-04-24 10:36:30,183 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader
#1 for port 59446: readAndProcess from client threw exception [org.apache.hadoop.ipc.RpcServerException:
IPC server unable to read call parameters: 1382376565]
> {noformat}
> application_1429683757595_0320, application_1429683757595_0321

This message was sent by Atlassian JIRA

View raw message