spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhrubajyoti Hati <dhruba.w...@gmail.com>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Wed, 11 Sep 2019 17:02:08 GMT
Also the performance remains identical when running the same script from
jupyter terminal instead or normal terminal. In the script the spark
context is created by

spark = SparkSession \
    .builder \
..
..
getOrCreate() command


On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
wrote:

> If you say that libraries are not transferred by default and in my case I
> haven't used any --py-files then just because the driver python is
> different I have facing 6x speed difference ? I am using client mode to
> submit the program but the udfs and all are executed in the executors, then
> why is the difference so much?
>
> I tried the prints
> For jupyter one the driver prints
> ../../jupyter-folder/venv
>
> and executors print /usr
>
> For spark-submit both of them print /usr
>
> The cluster is created few years back and used organisation wide. So how
> python 2.6.6 is installed, i honestly do not know.  I copied the whole
> jupyter from org git repo as it was shared, so i do not know how the venv
> was created or python for venv was created even.
>
> The os is CentOS release 6.9 (Final)
>
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:22 PM Abdeali Kothari <abdealikothari@gmail.com>
> wrote:
>
>> The driver python may not always be the same as the executor python.
>> You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
>>
>> The dependent libraries are not transferred by spark in any way unless
>> you do a --py-files or .addPyFile()
>>
>> Could you try this:
>> *import sys; print(sys.prefix)*
>>
>> on the driver, and also run this inside a UDF with:
>>
>> *def dummy(a):*
>> *    import sys; raise AssertionError(sys.prefix)*
>>
>> and get the traceback exception on the driver ?
>> This would be the best way to get the exact sys.prefix (python path) for
>> both the executors and driver.
>>
>> Also, could you elaborate on what environment is this ?
>> Linux? - CentOS/Ubuntu/etc. ?
>> How was the py 2.6.6 installed ?
>> How was the py 2.7.5 venv created and how what the base py 2.7.5
>> installed ?
>>
>> Also, how are you creating the Spark Session in jupyter ?
>>
>>
>> On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
>> wrote:
>>
>>> But would it be the case for multiple tasks running on the same worker
>>> and also both the tasks are running in client mode, so the one true is true
>>> for both or for neither. As mentioned earlier all the confs are same. I
>>> have checked and compared each conf.
>>>
>>> As Abdeali mentioned it must be because the  way libraries are in both
>>> the environments. Also i verified by running the same script for jupyter
>>> environment and was able to get the same result using the normal script
>>> which i was running with spark-submit.
>>>
>>> Currently i am searching for the ways the python packages are
>>> transferred from driver to spark cluster in client mode. Any info on that
>>> topic would be helpful.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Wed, 11 Sep, 2019, 7:06 PM Patrick McCarthy, <pmccarthy@dstillery.com>
>>> wrote:
>>>
>>>> Are you running in cluster mode? A large virtualenv zip for the driver
>>>> sent into the cluster on a slow pipe could account for much of that eight
>>>> minutes.
>>>>
>>>> On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati <dhruba.work@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just ran the same script in a shell in jupyter notebook and find the
>>>>> performance to be similar. So I can confirm this is because the libraries
>>>>> used jupyter notebook python is different than the spark-submit python
this
>>>>> is happening.
>>>>>
>>>>> But now I have a following question. Are the dependent libraries in a
>>>>> python script also transferred to the worker machines when executing
a
>>>>> python script in spark. Because though the driver python versions are
>>>>> different, the workers machines will use their same python environment
to
>>>>> run the code. If anyone can explain this part, it would be helpful.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>>>>>
>>>>>
>>>>> On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati <
>>>>> dhruba.work@gmail.com> wrote:
>>>>>
>>>>>> Just checked from where the script is submitted i.e. wrt Driver,
the
>>>>>> python env are different. Jupyter one is running within a the virtual
>>>>>> environment which is Python 2.7.5 and the spark-submit one uses 2.6.6.
But
>>>>>> the executors have the same python version right? I tried doing a
>>>>>> spark-submit from jupyter shell, it fails to find python 2.7  which
is not
>>>>>> there hence throws error.
>>>>>>
>>>>>> Here is the udf which might take time:
>>>>>>
>>>>>> import base64
>>>>>> import zlib
>>>>>>
>>>>>> def decompress(data):
>>>>>>
>>>>>>     bytecode = base64.b64decode(data)
>>>>>>     d = zlib.decompressobj(32 + zlib.MAX_WBITS)
>>>>>>     decompressed_data = d.decompress(bytecode )
>>>>>>     return(decompressed_data.decode('utf-8'))
>>>>>>
>>>>>>
>>>>>> Could this because of the two python environment mismatch from Driver
side? But the processing
>>>>>>
>>>>>> happens in the executor side?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Regards,Dhrub*
>>>>>>
>>>>>> On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <
>>>>>> abdealikothari@gmail.com> wrote:
>>>>>>
>>>>>>> Maybe you can try running it in a python shell or
>>>>>>> jupyter-console/ipython instead of a spark-submit and check how
much time
>>>>>>> it takes too.
>>>>>>>
>>>>>>> Compare the env variables to check that no additional env
>>>>>>> configuration is present in either environment.
>>>>>>>
>>>>>>> Also is the python environment for both the exact same? I ask
>>>>>>> because it looks like you're using a UDF and if the Jupyter python
has
>>>>>>> (let's say) numpy compiled with blas it would be faster than
a numpy
>>>>>>> without it. Etc. I.E. Some library you use may be using pure
python and
>>>>>>> another may be using a faster C extension...
>>>>>>>
>>>>>>> What python libraries are you using in the UDFs? It you don't
use
>>>>>>> UDFs at all and use some very simple pure spark functions does
the time
>>>>>>> difference still exist?
>>>>>>>
>>>>>>> Also are you using dynamic allocation or some similar spark config
>>>>>>> which could vary performance between runs because the same resources
we're
>>>>>>> not utilized on Jupyter / spark-submit?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <javadba@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds like you have done your homework to properly compare
.   I'm
>>>>>>>> guessing the answer to the following is yes .. but in any
case:  are they
>>>>>>>> both running against the same spark cluster with the same
configuration
>>>>>>>> parameters especially executor memory and number of workers?
>>>>>>>>
>>>>>>>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati
<
>>>>>>>> dhruba.work@gmail.com>:
>>>>>>>>
>>>>>>>>> No, i checked for that, hence written "brand new" jupyter
>>>>>>>>> notebook. Also the time taken by both are 30 mins and
~3hrs as i am reading
>>>>>>>>> a 500  gigs compressed base64 encoded text data from
a hive table and
>>>>>>>>> decompressing and decoding in one of the udfs. Also the
time compared is
>>>>>>>>> from Spark UI not  how long the job actually takes after
submission. Its
>>>>>>>>> just the running time i am comparing/mentioning.
>>>>>>>>>
>>>>>>>>> As mentioned earlier, all the spark conf params even
match in two
>>>>>>>>> scripts and that's why i am puzzled what going on.
>>>>>>>>>
>>>>>>>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>>>>>>>> pmccarthy@dstillery.com> wrote:
>>>>>>>>>
>>>>>>>>>> It's not obvious from what you pasted, but perhaps
the juypter
>>>>>>>>>> notebook already is connected to a running spark
context, while
>>>>>>>>>> spark-submit needs to get a new spot in the (YARN?)
queue.
>>>>>>>>>>
>>>>>>>>>> I would check the cluster job IDs for both to ensure
you're
>>>>>>>>>> getting new cluster tasks for each.
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati
<
>>>>>>>>>> dhruba.work@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am facing a weird behaviour while running a
python script.
>>>>>>>>>>> Here is what the code looks like mostly:
>>>>>>>>>>>
>>>>>>>>>>> def fn1(ip):
>>>>>>>>>>>    some code...
>>>>>>>>>>>     ...
>>>>>>>>>>>
>>>>>>>>>>> def fn2(row):
>>>>>>>>>>>     ...
>>>>>>>>>>>     some operations
>>>>>>>>>>>     ...
>>>>>>>>>>>     return row1
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> udf_fn1 = udf(fn1)
>>>>>>>>>>> cdf = spark.read.table("xxxx") //hive table is
of size > 500
>>>>>>>>>>> Gigs with ~4500 partitions
>>>>>>>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz))
\
>>>>>>>>>>>     .drop("colz") \
>>>>>>>>>>>     .withColumnRenamed("colz", "coly")
>>>>>>>>>>>
>>>>>>>>>>> edf = ddf \
>>>>>>>>>>>     .filter(ddf.colp == 'some_value') \
>>>>>>>>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>>>>>>>>     .toDF()
>>>>>>>>>>>
>>>>>>>>>>> print edf.count() // simple way for the performance
test in both
>>>>>>>>>>> platforms
>>>>>>>>>>>
>>>>>>>>>>> Now when I run the same code in a brand new jupyter
notebook it
>>>>>>>>>>> runs 6x faster than when I run this python script
using spark-submit. The
>>>>>>>>>>> configurations are printed and  compared from
both the platforms and they
>>>>>>>>>>> are exact same. I even tried to run this script
in a single cell of jupyter
>>>>>>>>>>> notebook and still have the same performance.
I need to understand if I am
>>>>>>>>>>> missing something in the spark-submit which is
causing the issue.  I tried
>>>>>>>>>>> to minimise the script to reproduce the same
error without much code.
>>>>>>>>>>>
>>>>>>>>>>> Both are run in client mode on a yarn based spark
cluster. The
>>>>>>>>>>> machines from which both are executed are also
the same and from same user.
>>>>>>>>>>>
>>>>>>>>>>> What i found is the  the quantile values for
median for one ran
>>>>>>>>>>> with jupyter was 1.3 mins and one ran with spark-submit
was ~8.5 mins.  I
>>>>>>>>>>> am not able to figure out why this is happening.
>>>>>>>>>>>
>>>>>>>>>>> Any one faced this kind of issue before or know
how to resolve
>>>>>>>>>>> this?
>>>>>>>>>>>
>>>>>>>>>>> *Regards,*
>>>>>>>>>>> *Dhrub*
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Patrick McCarthy  *
>>>>>>>>>>
>>>>>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>>>>>
>>>>>>>>>> Dstillery
>>>>>>>>>>
>>>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>>>>>
>>>>>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>

Mime
View raw message