spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Markey <kevin.mar...@oracle.com>
Subject Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
Date Wed, 21 May 2014 23:03:06 GMT
<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster
    mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or
    2.2.&nbsp; The application successfully ran to conclusion but it
    ultimately failed.&nbsp; <br>
    <br>
    There were 2 anomalies...<br>
    <br>
    1. ASM reported only that the application was "ACCEPTED".&nbsp; It never
    indicated that the application was "RUNNING."<br>
    <blockquote><tt>14/05/21 16:06:12 INFO yarn.Client: Application
        report from ASM:</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; application identifier: application_1400696988985_0007</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appId: 7</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; clientToAMToken: null</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appDiagnostics:</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appMasterHost: N/A</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appQueue: default</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appMasterRpcPort: -1</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appStartTime: 1400709970857</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; yarnAppState: ACCEPTED</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; distributedFinalState: UNDEFINED</tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appTrackingUrl:
        <a class="moz-txt-link-freetext" href="http://Sleepycat:8088/proxy/application_1400696988985_0007/">http://Sleepycat:8088/proxy/application_1400696988985_0007/</a></tt><br>
      <tt>&nbsp;&nbsp;&nbsp;&nbsp; appUser: hduser</tt><br>
    </blockquote>
    Furthermore, it <b>started a second container</b>, running two
    partly <b>overlapping</b> drivers, when it appeared that the
    application never started.&nbsp; Each container ran to conclusion as
    explained above, taking twice as long as usual for both to
    complete.&nbsp; Both instances had the same concluding failure.<br>
    <br>
    2. Each instance failed as indicated by the stderr log, finding that
    the <b>filesystem was closed</b> when trying to clean up the
    staging directories.&nbsp; <br>
    <br>
    <tt>14/05/21 16:08:24 INFO Executor: Serialized size of result for
      1453 is 863</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO Executor: Sending result for 1453
      directly to driver</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO Executor: Finished task ID 1453</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in
      202 ms on localhost (progress: 2/2)</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO DAGScheduler: Completed
      ResultTask(1507, 1)</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet
      1507.0, whose tasks have all completed, from pool</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at
      KEval.scala:32) finished in 0.417 s</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO SparkContext: Job finished: count at
      KEval.scala:32, took 1.532789283 s</tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at
<a class="moz-txt-link-freetext" href="http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250">http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250</a></tt><tt><br>
    </tt><tt>14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO MapOutputTrackerMasterActor:
      MapOutputTrackerActor stopped!</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO ConnectionManager: Selector thread
      was interrupted!</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager
      stopped</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO BlockManager: BlockManager stopped</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping
      BlockManagerMaster</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO BlockManagerMaster:
      BlockManagerMaster stopped</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO SparkContext: Successfully stopped
      SparkContext</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO
      RemoteActorRefProvider$RemotingTerminator: Shutting down remote
      daemon.</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO ApplicationMaster: <b>finishApplicationMaster
        with SUCCEEDED</b></tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO ApplicationMaster: AppMaster
      received a signal.</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging
      directory .sparkStaging/application_1400696988985_0007</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 INFO
      RemoteActorRefProvider$RemotingTerminator: Remote daemon shut
      down; proceeding with flushing remote transports.</tt><tt><br>
    </tt><tt>14/05/21 16:08:25 ERROR <b>ApplicationMaster: Failed to
        cleanup staging dir .sparkStaging/application_1400696988985_0007</b></tt><tt><br>
    </tt><b><tt>java.io.IOException: Filesystem closed</tt></b><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
      org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
      org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386)</tt><tt><br>
    </tt><tt>&nbsp;&nbsp;&nbsp; at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)</tt><br>
    <br>
    There is nothing about the staging directory themselves that looks
    suspicious...&nbsp; <br>
    <br>
    <tt>drwx------&nbsp;&nbsp; - hduser supergroup&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0 2014-05-21 16:06
      /user/hduser/.sparkStaging/application_1400696988985_0007</tt><tt><br>
    </tt><tt>-rw-r--r--&nbsp;&nbsp; 3 hduser supergroup&nbsp;&nbsp;
92881278 2014-05-21
      16:06
      /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar</tt><tt><br>
    </tt><tt>-rw-r--r--&nbsp;&nbsp; 3 hduser supergroup&nbsp; 118900783
2014-05-21
      16:06
/user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar</tt><br>
    <br>
    Just prior to the staging directory cleanup, the application
    concluded by writing results to 3 HDFS files.&nbsp; That occurred without
    incident.&nbsp; <br>
    <br>
    This particular test was run using ...<br>
    <br>
    1. RC10 compiled as follows:&nbsp; <b>mvn -Pyarn -Phadoop-2.3
      -Dhadoop.version=2.3.0 -DskipTests clean package</b><br>
    2. Ran in yarn-cluster mode using spark-submit<br>
    <br>
    Is there any configuration new to 1.0.0 that I might be missing.&nbsp; I
    walked through all the changes in the Yarn deploy web page, updating
    my scripts and configuration appropriately, and running except for
    these two anomalies.<br>
    <br>
    Thanks<br>
    Kevin Markey<br>
    <br>
    <br>
    <br>
  </body>
</html>

Mime
View raw message