spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Wed, 11 Sep 2019 13:36:23 GMT
Are you running in cluster mode? A large virtualenv zip for the driver sent
into the cluster on a slow pipe could account for much of that eight
minutes.

On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati <dhruba.work@gmail.com>
wrote:

> Hi,
>
> I just ran the same script in a shell in jupyter notebook and find the
> performance to be similar. So I can confirm this is because the libraries
> used jupyter notebook python is different than the spark-submit python this
> is happening.
>
> But now I have a following question. Are the dependent libraries in a
> python script also transferred to the worker machines when executing a
> python script in spark. Because though the driver python versions are
> different, the workers machines will use their same python environment to
> run the code. If anyone can explain this part, it would be helpful.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati <dhruba.work@gmail.com>
> wrote:
>
>> Just checked from where the script is submitted i.e. wrt Driver, the
>> python env are different. Jupyter one is running within a the virtual
>> environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But
>> the executors have the same python version right? I tried doing a
>> spark-submit from jupyter shell, it fails to find python 2.7  which is not
>> there hence throws error.
>>
>> Here is the udf which might take time:
>>
>> import base64
>> import zlib
>>
>> def decompress(data):
>>
>>     bytecode = base64.b64decode(data)
>>     d = zlib.decompressobj(32 + zlib.MAX_WBITS)
>>     decompressed_data = d.decompress(bytecode )
>>     return(decompressed_data.decode('utf-8'))
>>
>>
>> Could this because of the two python environment mismatch from Driver side? But the
processing
>>
>> happens in the executor side?
>>
>>
>>
>>
>> *Regards,Dhrub*
>>
>> On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <abdealikothari@gmail.com>
>> wrote:
>>
>>> Maybe you can try running it in a python shell or
>>> jupyter-console/ipython instead of a spark-submit and check how much time
>>> it takes too.
>>>
>>> Compare the env variables to check that no additional env configuration
>>> is present in either environment.
>>>
>>> Also is the python environment for both the exact same? I ask because it
>>> looks like you're using a UDF and if the Jupyter python has (let's say)
>>> numpy compiled with blas it would be faster than a numpy without it. Etc.
>>> I.E. Some library you use may be using pure python and another may be using
>>> a faster C extension...
>>>
>>> What python libraries are you using in the UDFs? It you don't use UDFs
>>> at all and use some very simple pure spark functions does the time
>>> difference still exist?
>>>
>>> Also are you using dynamic allocation or some similar spark config which
>>> could vary performance between runs because the same resources we're not
>>> utilized on Jupyter / spark-submit?
>>>
>>>
>>> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <javadba@gmail.com> wrote:
>>>
>>>> Sounds like you have done your homework to properly compare .   I'm
>>>> guessing the answer to the following is yes .. but in any case:  are they
>>>> both running against the same spark cluster with the same configuration
>>>> parameters especially executor memory and number of workers?
>>>>
>>>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>>>> dhruba.work@gmail.com>:
>>>>
>>>>> No, i checked for that, hence written "brand new" jupyter notebook.
>>>>> Also the time taken by both are 30 mins and ~3hrs as i am reading a 500
>>>>> gigs compressed base64 encoded text data from a hive table and
>>>>> decompressing and decoding in one of the udfs. Also the time compared
is
>>>>> from Spark UI not  how long the job actually takes after submission.
Its
>>>>> just the running time i am comparing/mentioning.
>>>>>
>>>>> As mentioned earlier, all the spark conf params even match in two
>>>>> scripts and that's why i am puzzled what going on.
>>>>>
>>>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>>>> pmccarthy@dstillery.com> wrote:
>>>>>
>>>>>> It's not obvious from what you pasted, but perhaps the juypter
>>>>>> notebook already is connected to a running spark context, while
>>>>>> spark-submit needs to get a new spot in the (YARN?) queue.
>>>>>>
>>>>>> I would check the cluster job IDs for both to ensure you're getting
>>>>>> new cluster tasks for each.
>>>>>>
>>>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <
>>>>>> dhruba.work@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am facing a weird behaviour while running a python script.
Here is
>>>>>>> what the code looks like mostly:
>>>>>>>
>>>>>>> def fn1(ip):
>>>>>>>    some code...
>>>>>>>     ...
>>>>>>>
>>>>>>> def fn2(row):
>>>>>>>     ...
>>>>>>>     some operations
>>>>>>>     ...
>>>>>>>     return row1
>>>>>>>
>>>>>>>
>>>>>>> udf_fn1 = udf(fn1)
>>>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500
Gigs
>>>>>>> with ~4500 partitions
>>>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>>>>     .drop("colz") \
>>>>>>>     .withColumnRenamed("colz", "coly")
>>>>>>>
>>>>>>> edf = ddf \
>>>>>>>     .filter(ddf.colp == 'some_value') \
>>>>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>>>>     .toDF()
>>>>>>>
>>>>>>> print edf.count() // simple way for the performance test in both
>>>>>>> platforms
>>>>>>>
>>>>>>> Now when I run the same code in a brand new jupyter notebook
it runs
>>>>>>> 6x faster than when I run this python script using spark-submit.
The
>>>>>>> configurations are printed and  compared from both the platforms
and they
>>>>>>> are exact same. I even tried to run this script in a single cell
of jupyter
>>>>>>> notebook and still have the same performance. I need to understand
if I am
>>>>>>> missing something in the spark-submit which is causing the issue.
 I tried
>>>>>>> to minimise the script to reproduce the same error without much
code.
>>>>>>>
>>>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>>>> machines from which both are executed are also the same and from
same user.
>>>>>>>
>>>>>>> What i found is the  the quantile values for median for one ran
with
>>>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.
 I am not
>>>>>>> able to figure out why this is happening.
>>>>>>>
>>>>>>> Any one faced this kind of issue before or know how to resolve
this?
>>>>>>>
>>>>>>> *Regards,*
>>>>>>> *Dhrub*
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>> *Patrick McCarthy  *
>>>>>>
>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>
>>>>>> Dstillery
>>>>>>
>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>
>>>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Mime
View raw message