flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Metzger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1546) Failed job causes JobManager to shutdown due to uncatched WebFrontend exception
Date Sat, 14 Feb 2015 19:27:11 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321661#comment-14321661
] 

Robert Metzger commented on FLINK-1546:
---------------------------------------

Indeed. The uncaught exception doesn't cause the JM to die anymore.

Now I see the following output in the logs
{code}
20:21:48,968 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Received
job 49a866e90bce097d9ebb7f2caee0b103 (Read only job).
20:21:49,145 ERROR org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Job submission
failed.
org.apache.flink.runtime.JobException: Creating the input splits caused an error: File does
not exist: hdfs:/user/robert/datasets/access-100.log
	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:161)
	at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:194)
	at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:460)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:171)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnJobManager.scala:70)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:86)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/user/robert/datasets/access-100.log
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:339)
	at org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:403)
	at org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51)
	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:145)
	... 23 more
20:21:49,151 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-34.dima.tu-berlin.de:54138/user/taskmanager as 75fc90247a92e285f4cb45a7028b6fbd.
Current number of registered hosts is 10.
20:21:49,517 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-36.dima.tu-berlin.de:54285/user/taskmanager as c63f5e6425cf95175d37bea8d6be35fa.
Current number of registered hosts is 11.
20:21:49,635 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-18.dima.tu-berlin.de:40967/user/taskmanager as d6937096c3dc5989a688e5c23c38853c.
Current number of registered hosts is 12.
20:21:49,814 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-19.dima.tu-berlin.de:49978/user/taskmanager as 6c7be9dc4480c1acde14df555ae6d472.
Current number of registered hosts is 13.
20:21:50,216 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-33.dima.tu-berlin.de:55551/user/taskmanager as 9a630d28618a6b0938943d241a1da607.
Current number of registered hosts is 14.
20:21:50,952 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-24.dima.tu-berlin.de:50970/user/taskmanager as 236fbc6d5e1e7938592a8c14b74bfc55.
Current number of registered hosts is 15.
20:21:52,279 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-22.dima.tu-berlin.de:49420/user/taskmanager as ed1145f809f7e171af437a38f0fc2157.
Current number of registered hosts is 16.
20:21:52,482 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-28.dima.tu-berlin.de:49520/user/taskmanager as 33b02419e0065b63280dd9092aecb5a3.
Current number of registered hosts is 17.
20:21:52,569 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-30.dima.tu-berlin.de:51448/user/taskmanager as cdb89309facb0b449d2dfff21e373a96.
Current number of registered hosts is 18.
20:21:54,849 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at akka.tcp://flink@cloud-35.dima.tu-berlin.de:56979/user/taskmanager as 4c1e2109390d3929abd55311c6188bd0.
Current number of registered hosts is 19.
20:21:59,160 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Status
of job 49a866e90bce097d9ebb7f2caee0b103 (Read only job) changed to FAILEDCleanup job 49a866e90bce097d9ebb7f2caee0b103..
20:21:59,166 ERROR org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Could not
prepare the execution graph org.apache.flink.runtime.executiongraph.ExecutionGraph@33d028af
for archiving.
java.lang.IllegalStateException: Can only archive the job from a terminal state
	at org.apache.flink.runtime.executiongraph.ExecutionGraph.prepareForArchiving(ExecutionGraph.java:648)
	at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$removeJob(JobManager.scala:543)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:292)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnJobManager.scala:70)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:86)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

> Failed job causes JobManager to shutdown due to uncatched WebFrontend exception
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-1546
>                 URL: https://issues.apache.org/jira/browse/FLINK-1546
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 0.9
>            Reporter: Robert Metzger
>
> {code}
> 16:59:26,588 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Status
of job ef19b2b201d4b81f031334cb76eadc78 (Basic Page Rank Example) changed to FAILEDCleanup
job ef19b2b201d4b81f031334cb76eadc78..
> 16:59:26,591 ERROR akka.actor.OneForOneStrategy                                  - Can
only archive the job from a terminal state
> java.lang.IllegalStateException: Can only archive the job from a terminal state
> 	at org.apache.flink.runtime.executiongraph.ExecutionGraph.prepareForArchiving(ExecutionGraph.java:648)
> 	at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$removeJob(JobManager.scala:508)
> 	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:271)
> 	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> 	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> 	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> 	at org.apache.flink.yarn.YarnJobManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnJobManager.scala:70)
> 	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> 	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
> 	at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
> 	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> 	at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
> 	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> 	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:86)
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> 	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 16:59:26,595 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Stopping
webserver.
> 16:59:26,654 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Stopped
webserver.
> 16:59:26,656 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Stopping
job manager akka://flink/user/jobmanager.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message