spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lee <>
Subject RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date Wed, 18 Jun 2014 18:24:36 GMT
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out
looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts
is that this should be part of the staging and should be cleaned up as well when sc gets terminated.

[test@ spark]$ SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false" SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar
./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi
--driver-memory 512M --driver-library-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
--executor-memory 512M --executor-cores 1 --queue research --num-executors 2 examples/target/spark-examples_2.10-1.0.0.jar

Using properties file: null
Using properties file: null
Parsed arguments:
  master                  yarn
  deployMode              cluster
  executorMemory          512M
  executorCores           1
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            512M
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
  driverExtraJavaOptions  null
  supervise               false
  queue                   research
  numExecutors            2
  files                   null
  pyFiles                 null
  archives                null
  mainClass               org.apache.spark.examples.SparkPi
  primaryResource         file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
  name                    org.apache.spark.examples.SparkPi
  childArgs               []
  jars                    null
  verbose                 true

Default properties from null:

Using properties file: null
Main class:
System properties:
spark.driver.extraLibraryPath -> /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT -> true -> org.apache.spark.examples.SparkPi
Classpath elements:

Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700

Hi All,
Have anyone ran into the same problem? By looking at the source code in official release (rc11),this
property settings is set to false by default, however, I'm seeing the .sparkStaging folder
remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys
the fat JAR file (~115MB) every time for each job and it is not cleaned up.

yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:      val
preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx------   - test users        
 0 2014-05-01 01:42 .sparkStaging/application_1398370455828_0050drwx------   - test users
         0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx------   - test
users          0 2014-05-01 02:04 .sparkStaging/application_1398370455828_0052drwx------ 
 - test users          0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx------
  - test users          0 2014-05-01 05:45 .sparkStaging/application_1398370455828_0055drwx------
  - test users          0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx------
  - test users          0 2014-05-01 05:49 .sparkStaging/application_1398370455828_0057drwx------
  - test users          0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx------
  - test users          0 2014-05-01 05:58 .sparkStaging/application_1398370455828_0059drwx------
  - test users          0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx------
  - test users          0 2014-05-01 07:41 .sparkStaging/application_1398370455828_0061….drwx------
  - test users          0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx------
  - test users          0 2014-06-16 15:03 .sparkStaging/application_1402001910637_0135drwx------
  - test users          0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx------
  - test users          0 2014-06-16 15:46 .sparkStaging/application_1402001910637_0138drwx------
  - test users          0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx------
  - test users          0 2014-06-17 05:55 .sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in :SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the
end of the job rather then delete them.or this is a bug that is not honoring the default value
and is override to true somewhere?

View raw message