spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhrubajyoti Hati <dhruba.w...@gmail.com>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Wed, 11 Sep 2019 03:25:50 GMT
As mentioned in the very first mail:
* same cluster it is submitted.
* from same machine they are submitted and also from same user
* each of them has 128 executors and 2 cores per executor with 8Gigs of
memory each and both of them are getting that while running

to clarify more let me quote what I mentioned above. *These data is taken
from Spark-UI when the jobs are almost finished in both.*
"What i found is the  the quantile values for median for one ran with
jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
means per task time taken is much higher in spark-submit script than
jupyter script. This is where I am really puzzled because they are the
exact same code. why running them two different ways vary so much in the
execution time.




*Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*


On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch <javadba@gmail.com> wrote:

> Sounds like you have done your homework to properly compare .   I'm
> guessing the answer to the following is yes .. but in any case:  are they
> both running against the same spark cluster with the same configuration
> parameters especially executor memory and number of workers?
>
> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
> dhruba.work@gmail.com>:
>
>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>> compressed base64 encoded text data from a hive table and decompressing and
>> decoding in one of the udfs. Also the time compared is from Spark UI not
>> how long the job actually takes after submission. Its just the running time
>> i am comparing/mentioning.
>>
>> As mentioned earlier, all the spark conf params even match in two scripts
>> and that's why i am puzzled what going on.
>>
>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <pmccarthy@dstillery.com>
>> wrote:
>>
>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>> already is connected to a running spark context, while spark-submit needs
>>> to get a new spot in the (YARN?) queue.
>>>
>>> I would check the cluster job IDs for both to ensure you're getting new
>>> cluster tasks for each.
>>>
>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am facing a weird behaviour while running a python script. Here is
>>>> what the code looks like mostly:
>>>>
>>>> def fn1(ip):
>>>>    some code...
>>>>     ...
>>>>
>>>> def fn2(row):
>>>>     ...
>>>>     some operations
>>>>     ...
>>>>     return row1
>>>>
>>>>
>>>> udf_fn1 = udf(fn1)
>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
>>>> ~4500 partitions
>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>     .drop("colz") \
>>>>     .withColumnRenamed("colz", "coly")
>>>>
>>>> edf = ddf \
>>>>     .filter(ddf.colp == 'some_value') \
>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>     .toDF()
>>>>
>>>> print edf.count() // simple way for the performance test in both
>>>> platforms
>>>>
>>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>>> faster than when I run this python script using spark-submit. The
>>>> configurations are printed and  compared from both the platforms and they
>>>> are exact same. I even tried to run this script in a single cell of jupyter
>>>> notebook and still have the same performance. I need to understand if I am
>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>> to minimise the script to reproduce the same error without much code.
>>>>
>>>> Both are run in client mode on a yarn based spark cluster. The machines
>>>> from which both are executed are also the same and from same user.
>>>>
>>>> What i found is the  the quantile values for median for one ran with
>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>>> able to figure out why this is happening.
>>>>
>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>
>>>> *Regards,*
>>>> *Dhrub*
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>

Mime
View raw message