spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anwar AliKhan <anwaralikhan...@gmail.com>
Subject Add python library
Date Sat, 06 Jun 2020 20:16:07 GMT
 " > Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
 "

This is weird !
I was hanging out here https://machinelearningmastery.com/start-here/.
When I came across this post.

The weird part is I was just wondering  how I can take one of the
projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
further.

I want to run on Spark using Spark's parallelism features and GPU
capabilities,  when I am using bigger datasets . While installing the
workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or
any other location  , preferably showing full self contained examples?



On Fri, 5 Jun 2020, 09:02 Dark Crusader, <relinquisheddragon@gmail.com>
wrote:

> Hi Stone,
>
>
> I haven't tried it with .so files however I did use the approach he
> recommends to install my other dependencies.
> I Hope it helps.
>
> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zhong@gmail.com> wrote:
>
>> Hi,
>>
>> So my pyspark app depends on some python libraries, it is not a problem,
>> I pack all the dependencies into a file libs.zip, and then call
>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>
>> Then I encountered a problem, if any of my library has any binary file
>> dependency (like .so files), this approach does not work. Mainly because
>> when you set PYTHONPATH to a zip file, python does not look up needed
>> binary library (e.g. a .so file) inside the zip file, this is a python
>> *limitation*. So I got a workaround:
>>
>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>> directory
>> 2) When my python code starts, manually call *sys.path.insert(0,
>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>
>> This workaround works well for me. Then I got another problem: what if my
>> code in executor need python library that has binary code? Below is am
>> example:
>>
>> def do_something(p):
>>     ...
>>
>> rdd = sc.parallelize([
>>     {"x": 1, "y": 2},
>>     {"x": 2, "y": 3},
>>     {"x": 3, "y": 4},
>> ])
>> a = rdd.map(do_something)
>>
>> What if the function "do_something" need a python library that has
>> binary code? My current solution is, extract libs.zip into a NFS share (or
>> a SMB share) and manually do *sys.path.insert(0,
>> f"share_mount_dir/libs") *in my "do_something" function, but adding such
>> code in each function looks ugly, is there any better/elegant solution?
>>
>> Thanks,
>> Stone
>>
>>

Mime
View raw message