spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: PySpark on Yarn a lot of python scripts project
Date Fri, 05 Sep 2014 17:50:00 GMT
On Fri, Sep 5, 2014 at 10:21 AM, Oleg Ruchovets <oruchovets@gmail.com> wrote:
> Ok , I  didn't explain my self correct:
>    In case of java having a lot of classes jar should be used.
>    All examples for PySpark I found is one py script( Pi , wordcount ...) ,
> but in real environment analytics has more then one py file.
>    My question is how to use PySpark on Yarn analytics in case multiple
> python files.
>
> I a not so sure that using coma separated python files is a good option in
> my case ( we have quite a lot of files).
>   In case of using zip option:
>      Is it just a zip all python files like in jar in java?
>      In java there is a Manifest file which points to the main method?
>      Is the zip option best practice or there are other techniques?

In daily development, it's common to modify your projects and re-run
the jobs. If using zip or egg to package your code, you need to do
this every time after modification, I think it will be boring.

If the code is storaged and shared to the slaves via an shared file
system, then it's pretty easy to modify and re-run your job, just like
in local machine.

> Thanks
> Oleg.
>
>
> On Sat, Sep 6, 2014 at 1:01 AM, Dimension Data, LLC.
> <subscriptions@didata.us> wrote:
>>
>> Hi:
>>
>> Curious... is there any reason not to use one of the below pyspark options
>> (in red)? Assuming each file is, say 10k in size, is 50 files too much?
>> Does that touch on some practical limitation?
>>
>>
>> Usage: ./bin/pyspark [options]
>> Options:
>>   --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
>> or local.
>>   --deploy-mode DEPLOY_MODE   Where to run the driver program: either
>> "client" to run
>>                               on the local machine, or "cluster" to run
>> inside cluster.
>>   --class CLASS_NAME          Your application's main class (for Java /
>> Scala apps).
>>   --name NAME                 A name of your application.
>>   --jars JARS                 Comma-separated list of local jars to
>> include on the driver
>>                               and executor classpaths.
>>
>>   --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py
>> files to place
>>                               on the PYTHONPATH for Python apps.
>>
>>   --files FILES               Comma-separated list of files to be placed
>> in the working
>>                               directory of each executor.
>> [ ... snip ... ]
>>
>>
>>
>>
>>
>> On 09/05/2014 12:00 PM, Davies Liu wrote:
>> > Hi Oleg,
>> >
>> > In order to simplify the process of package and distribute you
>> > codes, you could deploy an shared storage (such as NFS), and put your
>> > project in it, mount it to all the slaves as "/projects".
>> >
>> > In the spark job scripts, you can access your project by put the
>> > path into sys.path, such as:
>> >
>> > import sys sys.path.append("/projects") import myproject
>> >
>> > Davies
>> >
>> > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets <oruchovets@gmail.com>
>> > wrote:
>> >> Hi , We avaluating PySpark  and successfully executed examples of
>> >> PySpark on Yarn.
>> >>
>> >> Next step what we want to do: We have a python project ( bunch of
>> >> python script using Anaconda packages). Question: What is the way
>> >> to execute PySpark on Yarn having a lot of python files ( ~ 50)?
>> >> Should it be packaged in archive? How the command to execute
>> >> Pyspark on Yarn with a lot of files will looks like? Currently
>> >> command looks like:
>> >>
>> >> ./bin/spark-submit --master yarn  --num-executors 3
>> >> --driver-memory 4g --executor-memory 2g --executor-cores 1
>> >> examples/src/main/python/wordcount.py   1000
>> >>
>> >> Thanks Oleg.
>> >
>> > ---------------------------------------------------------------------
>> >
>> >
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message