spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saisai Shao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17340) .sparkStaging not cleaned if application exited incorrectly
Date Thu, 01 Sep 2016 08:40:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454777#comment-15454777
] 

Saisai Shao commented on SPARK-17340:
-------------------------------------

I think in your scenario, it is because you killed local process {{yarn-client}} with ctrl-c
deliberately, which lose the ability to monitor application and delete the staging dir. If
you could check the {{Client.scala#monitorApplication}}, you will find:

{code}
      if (state == YarnApplicationState.FINISHED ||
        state == YarnApplicationState.FAILED ||
        state == YarnApplicationState.KILLED) {
        cleanupStagingDir(appId)
        return (state, report.getFinalApplicationStatus)
      }
{code}

Even the application is killed by yarn command, the stage dir can be cleaned out. But since
you kill this process, so no one is monitoring AM now.

So according to my test, normally this staging dir will be deleted, no matter the application
is finished with success, fail or kill. You case is a tricky one I think, I'm inclined to
not fix such issue. Because normally people will not use like this.

> .sparkStaging not cleaned if application exited incorrectly
> -----------------------------------------------------------
>
>                 Key: SPARK-17340
>                 URL: https://issues.apache.org/jira/browse/SPARK-17340
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.5.2, 1.6.1, 2.0.0
>            Reporter: Artur Sukhenko
>            Priority: Minor
>
> When running Spark (yarn,cluster mode) and killing application
> .sparkStaging is not cleaned.
> Reproduce:
> 1. run SparkPi job in yarn cluster mode
> 2. Wait app to switch to RUNNING and press Ctrl+C
> 3. kill app: $ yarn application -kill <app id>
> 4. Verify that it is not cleaned up: $ hadoop fs -ls .sparkStaging
> Tested in Spark 1.5.2, 1.6.1, 2.0.0 - same error.
> All of these apps are already finished/killed, but sparkStaging/application_<id>
remains:
> {code}
> $ hadoop fs -ls .sparkStaging
> Found 6 items
> drwx------   - user user          3 2016-08-26 00:57 .sparkStaging/application_1472140614688_0001
> drwx------   - user user          3 2016-08-26 01:09 .sparkStaging/application_1472140614688_0002
> drwx------   - user user          3 2016-08-26 19:51 .sparkStaging/application_1472140614688_0005
> drwx------   - user user          3 2016-08-26 19:53 .sparkStaging/application_1472140614688_0007
> drwx------   - user user          3 2016-08-31 22:43 .sparkStaging/application_1472634296300_0011
> drwx------   - user user          3 2016-08-31 23:30 .sparkStaging/application_1472651370711_0006
> {code}
> {code}
> ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/08/26 00:51:17 INFO spark.SparkContext: Invoking stop() from shutdown hook
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 506.0 in stage 0.0 (TID
506, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 507.0 in stage 0.0 (TID
507, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 504.0 in stage 0.0 (TID
504) in 14 ms on node1 (505/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 505.0 in stage 0.0 (TID
505) in 14 ms on node1 (506/1000)
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
> 16/08/26 00:51:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 508.0 in stage 0.0 (TID
508, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 507.0 in stage 0.0 (TID
507) in 20 ms on node1 (507/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 509.0 in stage 0.0 (TID
509, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 506.0 in stage 0.0 (TID
506) in 38 ms on node1 (508/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 510.0 in stage 0.0 (TID
510, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 508.0 in stage 0.0 (TID
508) in 35 ms on node1 (509/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 511.0 in stage 0.0 (TID
511, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 510.0 in stage 0.0 (TID
510) in 11 ms on node1 (510/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 512.0 in stage 0.0 (TID
512, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 509.0 in stage 0.0 (TID
509) in 25 ms on node1 (511/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 513.0 in stage 0.0 (TID
513, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 511.0 in stage 0.0 (TID
511) in 22 ms on node1 (512/1000)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 514.0 in stage 0.0 (TID
514, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 512.0 in stage 0.0 (TID
512) in 26 ms on node1 (513/1000)
> 16/08/26 00:51:17 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.33.11:41489
> 16/08/26 00:51:17 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Starting task 515.0 in stage 0.0 (TID
515, node1, PROCESS_LOCAL, 2085 bytes)
> 16/08/26 00:51:17 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:36)
failed in 7.827 s
> 16/08/26 00:51:17 INFO scheduler.DAGScheduler: Job 0 failed: reduce at SparkPi.scala:36,
took 8.039526 s
> 16/08/26 00:51:17 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
> 16/08/26 00:51:17 INFO scheduler.TaskSetManager: Finished task 514.0 in stage 0.0 (TID
514) in 21 ms on node1 (514/1000)
> 16/08/26 00:51:17 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException:
Job cancelled because SparkContext was shut down
> org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:703)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
> 	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> 	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:702)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1514)
> 	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
> 	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1438)
> 	at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1724)
> 	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
> 	at org.apache.spark.SparkContext.stop(SparkContext.scala:1723)
> 	at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:587)
> 	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> 	at scala.util.Try$.apply(Try.scala:161)
> 	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> 	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> 	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
> 	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> 	at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> 	at org.apache.spark.rdd.RDD.reduce(RDD.scala:989)
> 	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:36)
> 	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> 16/08/26 00:51:17 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15,
(reason: User class threw exception: org.apache.spark.SparkException: Job cancelled because
SparkContext was shut down)
> 16/08/26 00:51:17 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut
down
> End of LogType:stderr
> {code}
> Currently cleanupStagingDir is run when app has final status SUCCEEDED.
> {code:title=ApplicationMaster.java}
>   if (!unregistered) {
>           // we only want to unregister if we don't want the RM to retry
>           if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
>             unregister(finalStatus, finalMsg)
>             cleanupStagingDir(fs)
>           }
>         }
> {code}
> If we kill spark-submit in terminal with Ctrl+C -  the cleanup is not performed.
> Can we do unregister and cleanup if we get 
> (yarn kill) 
> exitCode 15 in Spark 1.5.X/1.6.X
> {code}
> ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class
threw exception: org.apache.spark.SparkException: Job cancelled because SparkContext was shut
down)
> {code}
> and exitCode 16 in Spark 2.0.X
> {code}
> ERROR ApplicationMaster: RECEIVED SIGNAL TERM
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook
called before final status was reported.)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message