spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Zhemzhitsky <>
Subject Best way of shipping self-contained pyspark jobs with 3rd-party dependencies
Date Fri, 08 Dec 2017 13:00:19 GMT
Hi PySparkers,

What currently is the best way of shipping self-contained pyspark jobs
with 3rd-party dependencies?
There are some open JIRA issues [1], [2] as well as corresponding PRs
[3], [4] and articles [5], [6], [7] regarding setting up the python
environment with conda and virtualenv respectively, and I believe [7]
is misleading article, because of unsupported spark options, like
spark.pyspark.virtualenv.requirements, etc.

So I'm wondering what the community does in cases, when it's necessary to
- prevent python package/module version conflicts between different jobs
- prevent updating all the nodes of the cluster in case of new job dependencies
- track which dependencies are introduced on the per-job basis


To unsubscribe e-mail:

View raw message