spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique De Vito <ddv36...@gmail.com>
Subject Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit
Date Thu, 12 Nov 2020 22:24:14 GMT
Thanks Mich

To be sure, are you really saying that, using the option
"spark.yarn.archive", YOU have been able to OVERRIDE installed Spark JAR
with the JAR given with the option "spark.yarn.archive" ?

No more than "spark.yarn.archive" ?

Thanks

Dominique





Le jeu. 12 nov. 2020 à 18:01, Mich Talebzadeh <mich.talebzadeh@gmail.com> a
écrit :

> As I understand Spark expects the jar files to be available on all nodes
> or if applicable on HDFS directory
>
> Putting Spark Jar files on HDFS
>
> In Yarn mode, *it is important that Spark jar files are available
> throughout the Spark cluster*. I have spent a fair bit of time on this
> and I recommend that you follow this procedure to make sure that the
> spark-submit job runs ok. Use the spark.yarn.archive configuration option
> and set that to the location of an archive (you create on HDFS) containing
> all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the
> archive. For example:
>
> 1) Create the archive:
>    jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .2) Create a directory on HDFS for the
jars accessible to the application
>    hdfs dfs -mkdir /jars3) Upload to HDFS:
>    hdfs dfs -put spark-libs.jar /jars4) For a large cluster, increase the replication
count of the Spark archive
>    so that you reduce the amount of times a NodeManager will do a remote copy
>    hdfs dfs -setrep -w 10 hdfs:///jars/spark-libs.jar (Change the amount of
>    replicas proportional to the number of total NodeManagers)3) In $SPARK_HOME/conf/spark-defaults.conf
file set
>   spark.yarn.archive to hdfs:///rhes75:9000/jars/spark-libs.jar. Similar to
>   below
>    spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
>
>
> Every node of Spark needs to have the
> same $SPARK_HOME/conf/spark-defaults.conf file
>
> HTH
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 12 Nov 2020 at 16:35, Russell Spitzer <russell.spitzer@gmail.com>
> wrote:
>
>> --driver-class-path does not move jars, so it is dependent on your Spark
>> resource manager (master). It is interpreted literally so if your files do
>> not exist in the location you provide relative where the driver is run,
>> they will not be placed on the classpath.
>>
>> Since the driver is responsible for moving jars specified in --jars, you
>> cannot use a jar specified by --jars to be in driver-class-path, since the
>> driver is already started and it's classpath is already set before any jars
>> are moved.
>>
>> Some distributions may change this behavior though, but this is the jist
>> of it.
>>
>> On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <ddv36a78@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 2.1 (BTW) on YARN.
>>>
>>> I am trying to upload JAR on YARN cluster, and to use them to replace
>>> on-site (alreading in-place) JAR.
>>>
>>> I am trying to do so through spark-submit.
>>>
>>> One helpful answer
>>> https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234
>>> is the following one:
>>>
>>> spark-submit --jars additional1.jar,additional2.jar \
>>>   --driver-class-path additional1.jar:additional2.jar \
>>>   --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
>>>   --class MyClass main-application.jar
>>>
>>> So, I understand the following:
>>>
>>>    - "--jars" is for uploading jar on each node
>>>    - "--driver-class-path" is for using uploaded jar for the driver.
>>>    - "--conf spark.executor.extraClassPath" is for using uploaded jar
>>>    for executors.
>>>
>>> While I master the filepaths for "--jars" within a spark-submit command,
>>> what will be the filepaths of the uploaded JAR to be used in
>>> "--driver-class-path" for example ?
>>>
>>> The doc says: "*JARs and files are copied to the working directory for
>>> each SparkContext on the executor nodes*"
>>>
>>> Fine, but for the following command, what should I put instead of XXX
>>> and YYY ?
>>>
>>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>>   --driver-class-path XXX:YYY \
>>>   --conf spark.executor.extraClassPath=XXX:YYY \
>>>   --class MyClass main-application.jar
>>>
>>> When using spark-submit, how can I reference the "*working directory
>>> for the SparkContext*" to form XXX and YYY filepath ?
>>>
>>> Thanks.
>>>
>>> Dominique
>>>
>>> PS: I have tried
>>>
>>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>>   --driver-class-path some1.jar:some2.jar \
>>>   --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
>>>   --class MyClass main-application.jar
>>>
>>> No success (if I made no mistake)
>>>
>>> And I have tried also:
>>>
>>> spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
>>>    --driver-class-path ./some1.jar:./some2.jar \
>>>    --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
>>>    --class MyClass main-application.jar
>>>
>>> No success either.
>>>
>>

Mime
View raw message