spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit
Date Thu, 12 Nov 2020 17:01:23 GMT
As I understand Spark expects the jar files to be available on all nodes or
if applicable on HDFS directory

Putting Spark Jar files on HDFS

In Yarn mode, *it is important that Spark jar files are available
throughout the Spark cluster*. I have spent a fair bit of time on this and
I recommend that you follow this procedure to make sure that the
spark-submit job runs ok. Use the spark.yarn.archive configuration option
and set that to the location of an archive (you create on HDFS) containing
all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the
archive. For example:

1) Create the archive:
   jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .2) Create a directory
on HDFS for the jars accessible to the application
   hdfs dfs -mkdir /jars3) Upload to HDFS:
   hdfs dfs -put spark-libs.jar /jars4) For a large cluster, increase
the replication count of the Spark archive
   so that you reduce the amount of times a NodeManager will do a remote copy
   hdfs dfs -setrep -w 10 hdfs:///jars/spark-libs.jar (Change the amount of
   replicas proportional to the number of total NodeManagers)3) In
$SPARK_HOME/conf/spark-defaults.conf file set
  spark.yarn.archive to hdfs:///rhes75:9000/jars/spark-libs.jar. Similar to
  below
   spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar


Every node of Spark needs to have the
same $SPARK_HOME/conf/spark-defaults.conf file

HTH



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 12 Nov 2020 at 16:35, Russell Spitzer <russell.spitzer@gmail.com>
wrote:

> --driver-class-path does not move jars, so it is dependent on your Spark
> resource manager (master). It is interpreted literally so if your files do
> not exist in the location you provide relative where the driver is run,
> they will not be placed on the classpath.
>
> Since the driver is responsible for moving jars specified in --jars, you
> cannot use a jar specified by --jars to be in driver-class-path, since the
> driver is already started and it's classpath is already set before any jars
> are moved.
>
> Some distributions may change this behavior though, but this is the jist
> of it.
>
> On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <ddv36a78@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am using Spark 2.1 (BTW) on YARN.
>>
>> I am trying to upload JAR on YARN cluster, and to use them to replace
>> on-site (alreading in-place) JAR.
>>
>> I am trying to do so through spark-submit.
>>
>> One helpful answer
>> https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234
>> is the following one:
>>
>> spark-submit --jars additional1.jar,additional2.jar \
>>   --driver-class-path additional1.jar:additional2.jar \
>>   --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
>>   --class MyClass main-application.jar
>>
>> So, I understand the following:
>>
>>    - "--jars" is for uploading jar on each node
>>    - "--driver-class-path" is for using uploaded jar for the driver.
>>    - "--conf spark.executor.extraClassPath" is for using uploaded jar
>>    for executors.
>>
>> While I master the filepaths for "--jars" within a spark-submit command,
>> what will be the filepaths of the uploaded JAR to be used in
>> "--driver-class-path" for example ?
>>
>> The doc says: "*JARs and files are copied to the working directory for
>> each SparkContext on the executor nodes*"
>>
>> Fine, but for the following command, what should I put instead of XXX and
>> YYY ?
>>
>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>   --driver-class-path XXX:YYY \
>>   --conf spark.executor.extraClassPath=XXX:YYY \
>>   --class MyClass main-application.jar
>>
>> When using spark-submit, how can I reference the "*working directory for
>> the SparkContext*" to form XXX and YYY filepath ?
>>
>> Thanks.
>>
>> Dominique
>>
>> PS: I have tried
>>
>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>   --driver-class-path some1.jar:some2.jar \
>>   --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
>>   --class MyClass main-application.jar
>>
>> No success (if I made no mistake)
>>
>> And I have tried also:
>>
>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>    --driver-class-path ./some1.jar:./some2.jar \
>>    --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
>>    --class MyClass main-application.jar
>>
>> No success either.
>>
>

Mime
View raw message