tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Tzoumas <ktzou...@apache.org>
Subject Re: Problem with large job on Flink-on-Tez application
Date Fri, 09 Jan 2015 09:39:20 GMT
Yes, I am using a custom input initializer, that might be the problem
indeed. Thanks!

On Thu, Jan 8, 2015 at 11:03 PM, Bikas Saha <bikas@hortonworks.com> wrote:

> Its likely that its hitting an NPE on this line of code
>
> Task targetTask = vertex.getTask(riEvent.getTargetIndex());
> targetTask.registerTezEvent(tezEvent); <<<<<<<<<<<<
this line
>
> So the error could be that you are generating X splits but less then X
> tasks. So at some point there is a split event for a non-existent task. Are
> you using any custom input initializer (or vertex manager)?
>
> Bikas
>
> -----Original Message-----
> From: Kostas Tzoumas [mailto:ktzoumas@apache.org]
> Sent: Thursday, January 08, 2015 1:41 PM
> To: dev@tez.apache.org
> Subject: Problem with large job on Flink-on-Tez application
>
> Hi folks,
>
> I am running a SQL-like query (similar to TPC-H Q3) on a prototype of
> Flink-on-Tez. I am using the latest Tez master. It runs fine on smaller
> scales (e.g., TPC-H scale 500), and I am getting the following error at
> scale 1250:
>
> 15/01/08 22:07:42 INFO client.DAGClientImpl: DAG initialized:
> CurrentState=Running
> 15/01/08 22:07:43 INFO client.DAGClientImpl: DAG: State: RUNNING Progress:
> 0% TotalTasks: 12092 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
> 15/01/08 22:07:47 INFO client.DAGClientImpl: DAG: State: FAILED Progress:
> 0% TotalTasks: 12092 Succeeded: 0 Running: 0 Failed: 0 Killed: 1246
> 15/01/08 22:07:47 INFO client.DAGClientImpl: DAG completed.
> FinalState=FAILED
> 15/01/08 22:07:47 ERROR client.TezExecutor: Flink Java Job at Thu Jan 08
> 22:07:33 CET 2015 failed with diagnostics: [Vertex failed,
> vertexName=DataSource (at getCustomerDataSet(TPCHQuery3.java:217)
>
> (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546,
> vertexId=vertex_1420727594991_0021_1_01, diagnostics=[Exception in
> VertexManager, vertex=vertex_1420727594991_0021_1_01 [DataSource (at
> getCustomerDataSet(TPCHQuery3.java:217)
>
> (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546],java.lang.NullPointerException
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.handleRoutedTezEvents(VertexImpl.java:3857)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1265)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:144)
> at
>
> org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.scheduleTasks(ImmediateStartVertexManager.java:100)
> at
>
> org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:75)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:3320)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$5100(VertexImpl.java:183)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3293)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3285)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1587)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:182)
> at
>
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1763)
> at
>
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1749)
> at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
>
>
> It may well be that this is a problem on the Flink side of things (albeit
> not visible in the stack trace), I was just wondering if the error rings a
> bell. Since I am creating an input task per input split, this job results
> in
> quite a few of input tasks.
>
> The logs contain one more exception [1] caused by the exception above.
>
> Best,
> Kostas
>
> [1]
>
> 2015-01-08 22:07:44,448 ERROR [Dispatcher thread: Central] impl.VertexImpl:
> Exception in VertexManager, vertex=vertex_1420727594991_0021_1_01
> [DataSource (at getCustomerDataSet(TPCHQuery3.java:217)
>
> (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException:
> java.lang.NullPointerException
> at
>
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:368)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:3320)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$5100(VertexImpl.java:183)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3293)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3285)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
>
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1587)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:182)
> at
>
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1763)
> at
>
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1749)
> at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:116)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.handleRoutedTezEvents(VertexImpl.java:3857)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1265)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:144)
> at
>
> org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.scheduleTasks(ImmediateStartVertexManager.java:100)
> at
>
> org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:75)
> at
>
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365)
> ... 16 more
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message