spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stone Zhong <>
Subject Add python library with native code
Date Fri, 05 Jun 2020 07:42:14 GMT

So my pyspark app depends on some python libraries, it is not a problem, I
pack all the dependencies into a file, and then call
*sc.addPyFile("")* and it works pretty well for a while.

Then I encountered a problem, if any of my library has any binary file
dependency (like .so files), this approach does not work. Mainly because
when you set PYTHONPATH to a zip file, python does not look up needed
binary library (e.g. a .so file) inside the zip file, this is a python
*limitation*. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the into current
2) When my python code starts, manually call *sys.path.insert(0,
f"{os.getcwd()}/libs")* to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my
code in executor need python library that has binary code? Below is am

def do_something(p):

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
a =

What if the function "do_something" need a python library that has
binary code? My current solution is, extract into a NFS share (or
a SMB share) and manually do *sys.path.insert(0, f"share_mount_dir/libs") *in
my "do_something" function, but adding such code in each function looks
ugly, is there any better/elegant solution?


View raw message