spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stone Zhong <stone.zh...@gmail.com>
Subject Re: Add python library with native code
Date Sat, 06 Jun 2020 12:04:13 GMT
Great, thank you Masood, will look into it.

Regards,
Stone

On Fri, Jun 5, 2020 at 7:47 PM Masood Krohy <masood.krohy@analytical.works>
wrote:

> Not totally sure it's gonna help your use case, but I'd recommend that you
> consider these too:
>
>    - pex  <https://github.com/pantsbuild/pex> A library and tool for
>    generating .pex (Python EXecutable) files
>    - cluster-pack  <https://github.com/criteo/cluster-pack>  cluster-pack
>    is a library on top of either pex or conda-pack to make your Python code
>    easily available on a cluster.
>
> Masood
>
> __________________
>
> Masood Krohy, Ph.D.
> Data Science Advisor|Platform Architecthttps://www.analytical.works
>
> On 6/5/20 4:29 AM, Stone Zhong wrote:
>
> Thanks Dark. Looked at that article. I think the article described
> approach B, let me summary both approach A and approach B
> A) Put libraries in a network share, mount on each node, and in your code,
> manually set PYTHONPATH
> B) In your code, manually install the necessary package using "pip install
> -r <temp_dir>"
>
> I think approach B is very similar to approach A, both has pros and cons.
> With B), your cluster need to have internet access (which in my case, our
> cluster runs in an isolated environment for security reason), but you can
> set a private pip server anyway and stage those needed packages, while for
> A, you need to have admin permission to be able to mount the network share
> which is also a devop burden.
>
> I am wondering if spark can create some new API to tackle this scenario
> instead of these workaround, which I suppose would be more clean and
> elegant.
>
> Regards,
> Stone
>
>
> On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader <relinquisheddragon@gmail.com>
> wrote:
>
>> Hi Stone,
>>
>> Have you looked into this article?
>>
>> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>>
>>
>> I haven't tried it with .so files however I did use the approach he
>> recommends to install my other dependencies.
>> I Hope it helps.
>>
>> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zhong@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> So my pyspark app depends on some python libraries, it is not a problem,
>>> I pack all the dependencies into a file libs.zip, and then call
>>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>>
>>> Then I encountered a problem, if any of my library has any binary file
>>> dependency (like .so files), this approach does not work. Mainly because
>>> when you set PYTHONPATH to a zip file, python does not look up needed
>>> binary library (e.g. a .so file) inside the zip file, this is a python
>>> *limitation*. So I got a workaround:
>>>
>>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>>> directory
>>> 2) When my python code starts, manually call *sys.path.insert(0,
>>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>>
>>> This workaround works well for me. Then I got another problem: what if
>>> my code in executor need python library that has binary code? Below is am
>>> example:
>>>
>>> def do_something(p):
>>>     ...
>>>
>>> rdd = sc.parallelize([
>>>     {"x": 1, "y": 2},
>>>     {"x": 2, "y": 3},
>>>     {"x": 3, "y": 4},
>>> ])
>>> a = rdd.map(do_something)
>>>
>>> What if the function "do_something" need a python library that has
>>> binary code? My current solution is, extract libs.zip into a NFS share (or
>>> a SMB share) and manually do *sys.path.insert(0,
>>> f"share_mount_dir/libs") *in my "do_something" function, but adding
>>> such code in each function looks ugly, is there any better/elegant
>>> solution?
>>>
>>> Thanks,
>>> Stone
>>>
>>>

Mime
View raw message