spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly
Date Tue, 29 Dec 2015 12:03:49 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073840#comment-15073840
] 

Sean Owen commented on SPARK-10789:
-----------------------------------

I don't think we want to build another config flag in here. It sounds like you want to build
an assembly that's appropriate for your version of "Hadoop", which includes access to custom
file systems. The general practice here has been to do just that, so then you could include
s3 FS libs as desired. It's probably good practice to make your own build anyway if it needs
to harmonize with EMR's version of Hadoop.

(BTW you could update the title to reference the problem more directly: spark-submit in cluster
mode can't use third-party libraries or something. Including the assembly isn't a problem.)

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---------------------------------------------------------------
>
>                 Key: SPARK-10789
>                 URL: https://issues.apache.org/jira/browse/SPARK-10789
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Jonathan Kelly
>         Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that gets launched
only includes the Spark assembly and not spark.driver.extraClassPath. This is of course by
design, since the driver actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra libraries
that are not part of the Spark assembly, there is no good way to include them. (I say "no
good way" because including them in the SPARK_CLASSPATH environment variable does cause the
SparkSubmit process to include them, but this is not acceptable because this environment variable
has long been deprecated, and it prevents the use of spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for the application
JAR and running in yarn-cluster mode. The SparkSubmit process needs the EmrFileSystem implementation
and its dependencies in the classpath in order to download the application JAR from S3, so
it fails with a ClassNotFoundException. (EMR currently gets around this by setting SPARK_CLASSPATH,
but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra classpath
whether it's client mode or cluster mode, and this seems to work, but I don't know if there
is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting spark.(driver,executor).extraClassPath
instead of SPARK_CLASSPATH): spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount
s3://my-bucket/spark-examples.jar s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
> 	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
> 	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
> 	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
> 	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
> 	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
> 	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
> 	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
> 	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> 	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
> 	at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
> 	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
> 	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
> 	at scala.collection.immutable.List.foreach(List.scala:318)
> 	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
> 	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
> 	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> 	at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
> 	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
> 	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem
not found
> 	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
> 	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
> 	... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message