spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: Add python library
Date Mon, 08 Jun 2020 14:01:42 GMT
I've found Anaconda encapsulates modules and dependencies and such nicely,
and you can deploy all the needed .so files and such by deploying a whole
conda environment.

I've used this method with success:
https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551

On Sat, Jun 6, 2020 at 4:16 PM Anwar AliKhan <anwaralikhanuae@gmail.com>
wrote:

>  " > Have you looked into this article?
> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>  "
>
> This is weird !
> I was hanging out here https://machinelearningmastery.com/start-here/.
> When I came across this post.
>
> The weird part is I was just wondering  how I can take one of the
> projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
> further.
>
> I want to run on Spark using Spark's parallelism features and GPU
> capabilities,  when I am using bigger datasets . While installing the
> workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
> Raspberry Pi (Linux).
>
> Are any other documents on official website which shows how to do that,
> or any other location  , preferably showing full self contained examples?
>
>
>
> On Fri, 5 Jun 2020, 09:02 Dark Crusader, <relinquisheddragon@gmail.com>
> wrote:
>
>> Hi Stone,
>>
>>
>> I haven't tried it with .so files however I did use the approach he
>> recommends to install my other dependencies.
>> I Hope it helps.
>>
>> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zhong@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> So my pyspark app depends on some python libraries, it is not a problem,
>>> I pack all the dependencies into a file libs.zip, and then call
>>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>>
>>> Then I encountered a problem, if any of my library has any binary file
>>> dependency (like .so files), this approach does not work. Mainly because
>>> when you set PYTHONPATH to a zip file, python does not look up needed
>>> binary library (e.g. a .so file) inside the zip file, this is a python
>>> *limitation*. So I got a workaround:
>>>
>>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>>> directory
>>> 2) When my python code starts, manually call *sys.path.insert(0,
>>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>>
>>> This workaround works well for me. Then I got another problem: what if
>>> my code in executor need python library that has binary code? Below is am
>>> example:
>>>
>>> def do_something(p):
>>>     ...
>>>
>>> rdd = sc.parallelize([
>>>     {"x": 1, "y": 2},
>>>     {"x": 2, "y": 3},
>>>     {"x": 3, "y": 4},
>>> ])
>>> a = rdd.map(do_something)
>>>
>>> What if the function "do_something" need a python library that has
>>> binary code? My current solution is, extract libs.zip into a NFS share (or
>>> a SMB share) and manually do *sys.path.insert(0,
>>> f"share_mount_dir/libs") *in my "do_something" function, but adding
>>> such code in each function looks ugly, is there any better/elegant solution?
>>>
>>> Thanks,
>>> Stone
>>>
>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Mime
View raw message