spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <java...@gmail.com>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Wed, 11 Sep 2019 03:32:17 GMT
Ok. Can't think of why that would happen.

Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.work@gmail.com>:

> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 executors and 2 cores per executor with 8Gigs of
> memory each and both of them are getting that while running
>
> to clarify more let me quote what I mentioned above. *These data is taken
> from Spark-UI when the jobs are almost finished in both.*
> "What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
> means per task time taken is much higher in spark-submit script than
> jupyter script. This is where I am really puzzled because they are the
> exact same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch <javadba@gmail.com> wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.work@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccarthy@dstillery.com> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>>    some code...
>>>>>     ...
>>>>>
>>>>> def fn2(row):
>>>>>     ...
>>>>>     some operations
>>>>>     ...
>>>>>     return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs
with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>>     .drop("colz") \
>>>>>     .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>>     .filter(ddf.colp == 'some_value') \
>>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>>     .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and
they
>>>>> are exact same. I even tried to run this script in a single cell of jupyter
>>>>> notebook and still have the same performance. I need to understand if
I am
>>>>> missing something in the spark-submit which is causing the issue.  I
tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same
user.
>>>>>
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I
am not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>

Mime
View raw message