It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue.

I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each.

On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <> wrote:

I am facing a weird behaviour while running a python script. Here is what the code looks like mostly:

def fn1(ip):
   some code...

def fn2(row):
    some operations
    return row1

udf_fn1 = udf(fn1)
cdf ="xxxx") //hive table is of size > 500 Gigs with ~4500 partitions
ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
    .drop("colz") \
    .withColumnRenamed("colz", "coly")

edf = ddf \
    .filter(ddf.colp == 'some_value') \ row: fn2(row)) \

print edf.count() // simple way for the performance test in both platforms

Now when I run the same code in a brand new jupyter notebook it runs 6x faster than when I run this python script using spark-submit. The configurations are printed and  compared from both the platforms and they are exact same. I even tried to run this script in a single cell of jupyter notebook and still have the same performance. I need to understand if I am missing something in the spark-submit which is causing the issue.  I tried to minimise the script to reproduce the same error without much code.

Both are run in client mode on a yarn based spark cluster. The machines from which both are executed are also the same and from same user.

What i found is the  the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not able to figure out why this is happening.

Any one faced this kind of issue before or know how to resolve this?



Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering


470 Park Ave South, 17th Floor, NYC 10016