From "Udit Mehrotra (JIRA)" <>
Subject [jira] [Created] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working
Date Tue, 13 Sep 2016 00:08:20 GMT
Udit Mehrotra created SPARK-17512:

             Summary: Specifying remote files for Python based Spark jobs in Yarn cluster
mode not working
                 Key: SPARK-17512
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Submit
    Affects Versions: 2.0.0
            Reporter: Udit Mehrotra

When I run a python application, and specify a remote path for the extra files to be included
in the PYTHON_PATH using the '--py-files' or 'spark.submit.pyFiles' configuration option in
YARN Cluster mode I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Launching Python applications
through spark-submit is currently only supported for local files: s3://xxxx/
at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$
at scala.collection.mutable.ArrayOps$
at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

Here are sample commands which would throw this error in Spark 2.0 ( requires

spark-submit --deploy-mode cluster --py-files s3://xxxx/ s3://xxxx/ (works
fine in 1.6)

spark-submit --deploy-mode cluster --conf spark.submit.pyFiles=s3://xxxx/ s3://xxxx/
(not working in 1.6)

This would work fine if is downloaded locally and specified.

This was working correctly using ‘—py-files’ option in earlier version of Spark, but
not using the ‘spark.submit.pyFiles’ configuration option. But now, it does not work through
either of the ways.

The following diff shows the comment which states that it should work with ‘non-local’
paths for the YARN cluster mode, and we are specifically doing separate validation to fail
if YARN client mode is used with remote paths:

And then this code gets triggered at the end of each run, irrespective of whether we are using
Client or Cluster mode, and internally validates that the paths should be non-local:

This above validation was not getting triggered in earlier version of Spark using ‘—py-files’
option because we were not storing the arguments passed to ‘—py-files’ in the ‘spark.submit.pyFiles’
configuration for YARN. However, the following code was newly added in 2.0 which now stores
it and hence this validation gets triggered even if we specify files through ‘—py-files’

Also, we changed the logic in YARN client, to read values directly from ‘spark.submit.pyFiles’
configuration instead of from ‘—py-files’ (earlier):

So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the validation
gets triggered in both cases irrespective of whether we use Client or Cluster mode with YARN.

