Thanks Mich

To be sure, are you really saying that, using the option "spark.yarn.archive", YOU have been able to OVERRIDE installed Spark JAR with the JAR given with the option "spark.yarn.archive" ?

No more than "spark.yarn.archive" ?

Thanks

Dominique



 

Le jeu. 12 nov. 2020 à 18:01, Mich Talebzadeh <mich.talebzadeh@gmail.com> a écrit :
As I understand Spark expects the jar files to be available on all nodes or if applicable on HDFS directory

Putting Spark Jar files on HDFS

In Yarn mode, it is important that Spark jar files are available throughout the Spark cluster. I have spent a fair bit of time on this and I recommend that you follow this procedure to make sure that the spark-submit job runs ok. Use the spark.yarn.archive configuration option and set that to the location of an archive (you create on HDFS) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:

1) Create the archive: 
   jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
2) Create a directory on HDFS for the jars accessible to the application
   hdfs dfs -mkdir /jars
3) Upload to HDFS: 
   hdfs dfs -put spark-libs.jar /jars
4) For a large cluster, increase the replication count of the Spark archive 
   so that you reduce the amount of times a NodeManager will do a remote copy
   hdfs dfs -setrep -w 10 hdfs:///jars/spark-libs.jar (Change the amount of 
   replicas proportional to the number of total NodeManagers)
3) In $SPARK_HOME/conf/spark-defaults.conf file set
  spark.yarn.archive to hdfs:///rhes75:9000/jars/spark-libs.jar. Similar to
  below
   spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar

Every node of Spark needs to have the same $SPARK_HOME/conf/spark-defaults.conf file

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 12 Nov 2020 at 16:35, Russell Spitzer <russell.spitzer@gmail.com> wrote:
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. 

Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Some distributions may change this behavior though, but this is the jist of it.

On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <ddv36a78@gmail.com> wrote:
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.