spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhrubajyoti Hati <>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Wed, 11 Sep 2019 04:15:54 GMT
Just checked from where the script is submitted i.e. wrt Driver, the python
env are different. Jupyter one is running within a the virtual environment
which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the
executors have the same python version right? I tried doing a spark-submit
from jupyter shell, it fails to find python 2.7  which is not there hence
throws error.

Here is the udf which might take time:

import base64
import zlib

def decompress(data):

    bytecode = base64.b64decode(data)
    d = zlib.decompressobj(32 + zlib.MAX_WBITS)
    decompressed_data = d.decompress(bytecode )

Could this because of the two python environment mismatch from Driver
side? But the processing

happens in the executor side?


On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <>

> Maybe you can try running it in a python shell or jupyter-console/ipython
> instead of a spark-submit and check how much time it takes too.
> Compare the env variables to check that no additional env configuration is
> present in either environment.
> Also is the python environment for both the exact same? I ask because it
> looks like you're using a UDF and if the Jupyter python has (let's say)
> numpy compiled with blas it would be faster than a numpy without it. Etc.
> I.E. Some library you use may be using pure python and another may be using
> a faster C extension...
> What python libraries are you using in the UDFs? It you don't use UDFs at
> all and use some very simple pure spark functions does the time difference
> still exist?
> Also are you using dynamic allocation or some similar spark config which
> could vary performance between runs because the same resources we're not
> utilized on Jupyter / spark-submit?
> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <> wrote:
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>>> wrote:
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <>
>>>> wrote:
>>>>> Hi,
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>> def fn1(ip):
>>>>>    some code...
>>>>>     ...
>>>>> def fn2(row):
>>>>>     ...
>>>>>     some operations
>>>>>     ...
>>>>>     return row1
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf ="xxxx") //hive table is of size > 500 Gigs
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>>     .drop("colz") \
>>>>>     .withColumnRenamed("colz", "coly")
>>>>> edf = ddf \
>>>>>     .filter(ddf.colp == 'some_value') \
>>>>> row: fn2(row)) \
>>>>>     .toDF()
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and
>>>>> are exact same. I even tried to run this script in a single cell of jupyter
>>>>> notebook and still have the same performance. I need to understand if
I am
>>>>> missing something in the spark-submit which is causing the issue.  I
>>>>> to minimise the script to reproduce the same error without much code.
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I
am not
>>>>> able to figure out why this is happening.
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>> *Regards,*
>>>>> *Dhrub*
>>>> --
>>>> *Patrick McCarthy  *
>>>> Senior Data Scientist, Machine Learning Engineering
>>>> Dstillery
>>>> 470 Park Ave South, 17th Floor, NYC 10016

View raw message