spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: script running in jupyter 6-7x faster than spark submit
Date Tue, 10 Sep 2019 19:14:27 GMT
It's not obvious from what you pasted, but perhaps the juypter notebook
already is connected to a running spark context, while spark-submit needs
to get a new spot in the (YARN?) queue.

I would check the cluster job IDs for both to ensure you're getting new
cluster tasks for each.

On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
wrote:

> Hi,
>
> I am facing a weird behaviour while running a python script. Here is what
> the code looks like mostly:
>
> def fn1(ip):
>    some code...
>     ...
>
> def fn2(row):
>     ...
>     some operations
>     ...
>     return row1
>
>
> udf_fn1 = udf(fn1)
> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
> ~4500 partitions
> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>     .drop("colz") \
>     .withColumnRenamed("colz", "coly")
>
> edf = ddf \
>     .filter(ddf.colp == 'some_value') \
>     .rdd.map(lambda row: fn2(row)) \
>     .toDF()
>
> print edf.count() // simple way for the performance test in both platforms
>
> Now when I run the same code in a brand new jupyter notebook it runs 6x
> faster than when I run this python script using spark-submit. The
> configurations are printed and  compared from both the platforms and they
> are exact same. I even tried to run this script in a single cell of jupyter
> notebook and still have the same performance. I need to understand if I am
> missing something in the spark-submit which is causing the issue.  I tried
> to minimise the script to reproduce the same error without much code.
>
> Both are run in client mode on a yarn based spark cluster. The machines
> from which both are executed are also the same and from same user.
>
> What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
> able to figure out why this is happening.
>
> Any one faced this kind of issue before or know how to resolve this?
>
> *Regards,*
> *Dhrub*
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Mime
View raw message