spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Markey <kevin.mar...@oracle.com>
Subject Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
Date Thu, 22 May 2014 07:20:32 GMT
<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Update:  Partly user error.  But still getting FS closed error.<br>
    <br>
    Yes, we are running plain vanilla Hadoop 2.3.0.  But it probably
    doesn't matter....<br>
    <br>
    1. Tried Colin McCabe's suggestion to patch with pull 850
    (<a class="moz-txt-link-freetext" href="https://issues.apache.org/jira/browse/SPARK-1898">https://issues.apache.org/jira/browse/SPARK-1898</a>). 
No effect.<br>
    <br>
    2. When testing Colin's patch, realized that <b>master</b> is set
    two places.  In <b>spark-submit</b> it is set using "--master
    yarn-cluster".  But it is also set via command line in my
    application (for SparkContext initialization) -- which I did not
    modify specifically for testing 1.0.  When modifying my scripts to
    use --master, I removed the user option --sm, which had the effect
    of setting the master to "local"!!  But local contained by two Yarn
    containers!!!  (This is almost too funny to imagine!)  (Each of the
    two container logs is identical, looking like a local mode log.)<br>
    <br>
    3. I corrected this option to also specify "yarn-cluster".  Now ASM
    reports that the application is RUNNING...<br>
    <br>
    <tt>14/05/22 00:09:08 INFO yarn.Client: Application report from ASM:
    </tt><tt><br>
    </tt><tt>     application identifier: application_1400738053238_0002</tt><tt><br>
    </tt><tt>     appId: 2</tt><tt><br>
    </tt><tt>     clientToAMToken: null</tt><tt><br>
    </tt><tt>     appDiagnostics: </tt><tt><br>
    </tt><tt>     appMasterHost: 192.168.1.21</tt><tt><br>
    </tt><tt>     appQueue: default</tt><tt><br>
    </tt><tt>     appMasterRpcPort: 0</tt><tt><br>
    </tt><tt>     appStartTime: 1400738864762</tt><tt><br>
    </tt><tt>     yarnAppState: RUNNING</tt><tt><br>
    </tt><tt>     distributedFinalState: UNDEFINED</tt><tt><br>
    </tt><tt>     appTrackingUrl:
      <a class="moz-txt-link-freetext" href="http://Sleepycat:8088/proxy/application_1400738053238_0002/">http://Sleepycat:8088/proxy/application_1400738053238_0002/</a></tt><tt><br>
    </tt><tt>     appUser: hduser</tt><br>
    <br>
    And it is reported as a <b>SUCCESS</b>, despite a Filesystem closed
    IOException when the AM attempts to clean up the staging directory. 
    <br>
    <br>
    (Now the two container logs are different.  One is that of a driver,
    the other that of an executor.)<br>
    <br>
    4. But there is still the Filesystem closed error.  This is reported
    only in the driver's/AM's stderr log.  It does not affect the final
    success of the job.  <br>
    <br>
    I'm not sure whether my app's opening and closing the FS has any
    bearing.  (It never has before.)  Before the application concludes,
    it saves several text files to HDFS by using Hadoop utilities to (1)
    get an instance of the filesystem, (2) an output stream to a path in
    HDFS, (3) write to it, (4) close the stream, and (5) close the
    filesystem -- for <b>each file to be written!</b>  But this is done
    by the driver thread, not by the application master.  (Does the FS
    object owned by ApplicationMaster interact with any arbitrary FS
    instance in the driver?)  Furthermore, it opens and closes the
    filesystem at least 3 times and as many as hundreds of times, and
    has in the past without side-effects.<br>
    <br>
    We've had dozens of arguments about whether to close such FS
    instances.  Sometimes we have problems if we close them.  Sometimes
    when we don't!<br>
    <br>
    I shall experiment with the FileSystem handling when it's
    convenient.<br>
    <br>
    5. Finally, I don't know if pull 850 had any effect.  I've not
    rolled it back to retest with the correct SparkContext master
    setting.<br>
    <br>
    Thanks for your feedback.<br>
    <br>
    Kevin<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 05/21/2014 11:54 PM, Tathagata Das
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAMwrk0mZQ=Xz3FsL0Df7oZXuKWwnN-OaNVb_mAj9Uv8pppHULA@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">Are you running a vanilla Hadoop 2.3.0
          or the one that comes with CDH5 / HDP(?) ? We may be able to
          reproduce this in that case. </div>
        <div class="gmail_extra"><br>
        </div>
        <div class="gmail_extra">
          TD<br>
          <br>
          <div class="gmail_quote">On Wed, May 21, 2014 at 8:35 PM, Tom
            Graves <span dir="ltr">&lt;<a moz-do-not-send="true"
                href="mailto:tgraves_cs@yahoo.com" target="_blank">tgraves_cs@yahoo.com</a>&gt;</span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div>
                <div
                  style="color:#000;background-color:#fff;font-family:HelveticaNeue,Helvetica
                  Neue,Helvetica,Arial,Lucida
                  Grande,sans-serif;font-size:12pt">
                  <div><span>It sounds like something is closing the
                      hdfs filesystem before everyone is really done
                      with it. The filesystem gets cached and is shared
                      so if someone closes it while other threads are
                      still using it you run into this error.  </span><span
                      style="font-size:12pt"> Is your application
                      closing the filesystem?  </span><span
                      style="font-size:12pt">   Are you using the event
                      logging feature?   Could you share the options you
                      are running with?</span></div>
                  <div
                    style="color:rgb(0,0,0);font-size:16px;font-family:HelveticaNeue,'Helvetica
                    Neue',Helvetica,Arial,'Lucida
                    Grande',sans-serif;background-color:transparent;font-style:normal"><br>
                  </div>
                  <div
                    style="color:rgb(0,0,0);font-size:16px;font-family:HelveticaNeue,'Helvetica
                    Neue',Helvetica,Arial,'Lucida
                    Grande',sans-serif;background-color:transparent;font-style:normal">
                    Yarn will retry the application depending on how the
                    Application Master attempt fails (this is a
                    configurable setting as to how many times it
                    retries).  That is probably the second driver you
                    are referring to.  But they shouldn't have
                    overlapped as far as both being up at the same time.
                    Is that the case you are seeing?  Generally you want
                    to look at why the first application attempt fails.</div>
                  <span class="HOEnZb"><font color="#888888">
                      <div
                        style="color:rgb(0,0,0);font-size:16px;font-family:HelveticaNeue,'Helvetica
                        Neue',Helvetica,Arial,'Lucida
                        Grande',sans-serif;background-color:transparent;font-style:normal">
                        <br>
                      </div>
                      <div
                        style="color:rgb(0,0,0);font-size:16px;font-family:HelveticaNeue,'Helvetica
                        Neue',Helvetica,Arial,'Lucida
                        Grande',sans-serif;background-color:transparent;font-style:normal"><span>Tom<br>
                          <br>
                        </span></div>
                    </font></span>
                  <div>
                    <div class="h5">
                      <div><br>
                        <br>
                      </div>
                      <div style="display:block">
                        <div style="font-family:HelveticaNeue,Helvetica
                          Neue,Helvetica,Arial,Lucida
                          Grande,sans-serif;font-size:12pt">
                          <div
                            style="font-family:HelveticaNeue,Helvetica
                            Neue,Helvetica,Arial,Lucida
                            Grande,sans-serif;font-size:12pt">
                            <div dir="ltr"> <font face="Arial"> On
                                Wednesday, May 21, 2014 6:10 PM, Kevin
                                Markey &lt;<a moz-do-not-send="true"
                                  href="mailto:kevin.markey@oracle.com"
                                  target="_blank">kevin.markey@oracle.com</a>&gt;
                                wrote:<br>
                              </font> </div>
                            <br>
                            <br>
                            <div>
                              <div>
                                <div> I tested an application on RC-10
                                  and Hadoop 2.3.0 in yarn-cluster mode
                                  that had run successfully with
                                  Spark-0.9.1 and Hadoop 2.3 or 2.2. 
                                  The application successfully ran to
                                  conclusion but it ultimately failed. 
                                  <br>
                                  <br>
                                  There were 2 anomalies...<br>
                                  <br>
                                  1. ASM reported only that the
                                  application was "ACCEPTED".  It never
                                  indicated that the application was
                                  "RUNNING."<br>
                                  <blockquote><tt>14/05/21 16:06:12 INFO
                                      yarn.Client: Application report
                                      from ASM:</tt><br>
                                    <tt>     application identifier:
                                      application_1400696988985_0007</tt><br>
                                    <tt>     appId: 7</tt><br>
                                    <tt>     clientToAMToken: null</tt><br>
                                    <tt>     appDiagnostics:</tt><br>
                                    <tt>     appMasterHost: N/A</tt><br>
                                    <tt>     appQueue: default</tt><br>
                                    <tt>     appMasterRpcPort: -1</tt><br>
                                    <tt>     appStartTime: 1400709970857</tt><br>
                                    <tt>     yarnAppState: ACCEPTED</tt><br>
                                    <tt>     distributedFinalState:
                                      UNDEFINED</tt><br>
                                    <tt>     appTrackingUrl: <a
                                        moz-do-not-send="true"
                                        rel="nofollow"
                                        href="http://sleepycat:8088/proxy/application_1400696988985_0007/"
                                        target="_blank">http://Sleepycat:8088/proxy/application_1400696988985_0007/</a></tt><br>
                                    <tt>     appUser: hduser</tt><br>
                                  </blockquote>
                                  Furthermore, it <b>started a second
                                    container</b>, running two partly <b>overlapping</b>
                                  drivers, when it appeared that the
                                  application never started.  Each
                                  container ran to conclusion as
                                  explained above, taking twice as long
                                  as usual for both to complete.  Both
                                  instances had the same concluding
                                  failure.<br>
                                  <br>
                                  2. Each instance failed as indicated
                                  by the stderr log, finding that the <b>filesystem
                                    was closed</b> when trying to clean
                                  up the staging directories.  <br>
                                  <br>
                                  <tt>14/05/21 16:08:24 INFO Executor:
                                    Serialized size of result for 1453
                                    is 863</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    Executor: Sending result for 1453
                                    directly to driver</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    Executor: Finished task ID 1453</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    TaskSetManager: Finished TID 1453 in
                                    202 ms on localhost (progress: 2/2)</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    DAGScheduler: Completed
                                    ResultTask(1507, 1)</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    TaskSchedulerImpl: Removed TaskSet
                                    1507.0, whose tasks have all
                                    completed, from pool</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    DAGScheduler: Stage 1507 (count at
                                    KEval.scala:32) finished in 0.417 s</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    SparkContext: Job finished: count at
                                    KEval.scala:32, took 1.532789283 s</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    SparkUI: Stopped Spark web UI at
                                    <a moz-do-not-send="true"
                                      rel="nofollow"
href="http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250/"
                                      target="_blank">http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250</a></tt><tt><br>
                                  </tt><tt>14/05/21 16:08:24 INFO
                                    DAGScheduler: Stopping DAGScheduler</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    MapOutputTrackerMasterActor:
                                    MapOutputTrackerActor stopped!</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    ConnectionManager: Selector thread
                                    was interrupted!</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    ConnectionManager: ConnectionManager
                                    stopped</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    MemoryStore: MemoryStore cleared</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    BlockManager: BlockManager stopped</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    BlockManagerMasterActor: Stopping
                                    BlockManagerMaster</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    BlockManagerMaster:
                                    BlockManagerMaster stopped</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    SparkContext: Successfully stopped
                                    SparkContext</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    RemoteActorRefProvider$RemotingTerminator:
                                    Shutting down remote daemon.</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    ApplicationMaster: <b>finishApplicationMaster

                                      with SUCCEEDED</b></tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    ApplicationMaster: AppMaster
                                    received a signal.</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    ApplicationMaster: Deleting staging
                                    directory
                                    .sparkStaging/application_1400696988985_0007</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 INFO
                                    RemoteActorRefProvider$RemotingTerminator:
                                    Remote daemon shut down; proceeding
                                    with flushing remote transports.</tt><tt><br>
                                  </tt><tt>14/05/21 16:08:25 ERROR <b>ApplicationMaster:
                                      Failed to cleanup staging dir
                                      .sparkStaging/application_1400696988985_0007</b></tt><tt><br>
                                  </tt><b><tt>java.io.IOException:
                                      Filesystem closed</tt></b><tt><br>
                                  </tt><tt>    at
                                    org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)</tt><tt><br>
                                  </tt><tt>    at
                                    org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685)</tt><tt><br>
                                  </tt><tt>    at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)</tt><tt><br>
                                  </tt><tt>    at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587)</tt><tt><br>
                                  </tt><tt>    at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)</tt><tt><br>
                                  </tt><tt>    at
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587)</tt><tt><br>
                                  </tt><tt>    at
                                    <a moz-do-not-send="true"
                                      href="http://org.apache.spark.deploy.yarn.ApplicationMaster.org"
                                      target="_blank">org.apache.spark.deploy.yarn.ApplicationMaster.org</a>$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371)</tt><tt><br>
                                  </tt><tt>    at
org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386)</tt><tt><br>
                                  </tt><tt>    at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)</tt><br>
                                  <br>
                                  There is nothing about the staging
                                  directory themselves that looks
                                  suspicious...  <br>
                                  <br>
                                  <tt>drwx------   - hduser
                                    supergroup          0 2014-05-21
                                    16:06
                                    /user/hduser/.sparkStaging/application_1400696988985_0007</tt><tt><br>
                                  </tt><tt>-rw-r--r--   3 hduser
                                    supergroup   92881278 2014-05-21
                                    16:06
                                    /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar</tt><tt><br>
                                  </tt><tt>-rw-r--r--   3 hduser
                                    supergroup  118900783 2014-05-21
                                    16:06
/user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar</tt><br>
                                  <br>
                                  Just prior to the staging directory
                                  cleanup, the application concluded by
                                  writing results to 3 HDFS files.  That
                                  occurred without incident.  <br>
                                  <br>
                                  This particular test was run using ...<br>
                                  <br>
                                  1. RC10 compiled as follows:  <b>mvn
                                    -Pyarn -Phadoop-2.3
                                    -Dhadoop.version=2.3.0 -DskipTests
                                    clean package</b><br>
                                  2. Ran in yarn-cluster mode using
                                  spark-submit<br>
                                  <br>
                                  Is there any configuration new to
                                  1.0.0 that I might be missing.  I
                                  walked through all the changes in the
                                  Yarn deploy web page, updating my
                                  scripts and configuration
                                  appropriately, and running except for
                                  these two anomalies.<br>
                                  <br>
                                  Thanks<br>
                                  Kevin Markey<br>
                                  <br>
                                  <br>
                                  <br>
                                </div>
                              </div>
                              <br>
                              <br>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

Mime
View raw message