From user-return-78977-apmail-spark-user-archive=spark.apache.org@spark.apache.org Wed Sep 11 17:02:58 2019 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by minotaur.apache.org (Postfix) with SMTP id 27D9C19E10 for ; Wed, 11 Sep 2019 17:02:58 +0000 (UTC) Received: (qmail 75206 invoked by uid 500); 11 Sep 2019 17:02:37 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 75113 invoked by uid 500); 11 Sep 2019 17:02:36 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 75091 invoked by uid 99); 11 Sep 2019 17:02:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Sep 2019 17:02:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 22C6EC58BD; Wed, 11 Sep 2019 17:02:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.801 X-Spam-Level: * X-Spam-Status: No, score=1.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id tHMe6rPX2bpp; Wed, 11 Sep 2019 17:02:33 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::42d; helo=mail-wr1-x42d.google.com; envelope-from=dhruba.work@gmail.com; receiver= Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com [IPv6:2a00:1450:4864:20::42d]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 8786D7DD5E; Wed, 11 Sep 2019 17:02:32 +0000 (UTC) Received: by mail-wr1-x42d.google.com with SMTP id d17so12641124wrq.13; Wed, 11 Sep 2019 10:02:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=2kubCA2hEx3mdihsYLlmSyQqBdRZBQgYxm/B3IrusFs=; b=iRWinLZjchUK4BElWNy97O9aB6ZYxy1FKJs+1VSXYj514MBsUFynFxxAAQ9dDgmQdg t7EYY+6I19chd7M7feMMiBRlF6GQlM2EnIJSnGyUgQuHDyodX4hgAJnHHxsIFnW1EfFD fnImgmuN3IU1o1JrzuREFY0MJPpAeflGI8ShSwvpl+nC82NJpxJSvrWQdPrq5CKtlP0p JXzCJlAUy0LNPE8HCpcCFgMhMFuT1FTtvs+L03WSifDoUeiCNLS6PX4PDDyU7/eWUL1x FGWCVb47ZiGzYnyHMV0S8fRjNPSAnS08nejc3dacTSNs0zLBri35UWkxxzKm4IgLTjq3 xUYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2kubCA2hEx3mdihsYLlmSyQqBdRZBQgYxm/B3IrusFs=; b=tcvUX4FENcF8xpQY3F2N6r6tb4eMNL/Ebr+0FY+x00ysTyR/zmZrXYIS/xQNfeR4P+ XYB5lW9CtGamze2qnul/r/+IXTukalSY20BJ2q4fgpBg6h+ooKJXXv9TJsG22/ig3+UU 7H0gq8XfGfs+on1ewjcrBKwQqBXLj91eYuCbounQTv1zy0bgruPb/zN8Bbr2gKrVlr+t RmTfoMaT3HqP4iZvt3RIkQg2aoqd8HLI0QLBlGdPV0isN2fM0XrUVMFTIOn703AyLSqK 8/mawaRBNzwiPoz6/xFkODmO9yzd2BXmXvyX4tSemxmu2T23+Es4v0JQixmJh570ejNq gQdw== X-Gm-Message-State: APjAAAW3JcHZnFo5xAjLup3zaZpzMNmekP/aMqLHOcFCJj46y9U0D549 e2D1xBDczhWuilferA1q4Az3jiMVuKakbm4Tr1k= X-Google-Smtp-Source: APXvYqykEFMXVpANwN1w4fZMfarOSgoq3YbURogpdhgVnKZ4P94SXR7uABEUtiDYhG/t3NYsydaORTjmJR63MaXs2GI= X-Received: by 2002:a5d:6648:: with SMTP id f8mr30359229wrw.167.1568221346080; Wed, 11 Sep 2019 10:02:26 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dhrubajyoti Hati Date: Wed, 11 Sep 2019 22:32:08 +0530 Message-ID: Subject: Re: script running in jupyter 6-7x faster than spark submit To: Abdeali Kothari Cc: Patrick McCarthy , Stephen Boesch , User , dev Content-Type: multipart/alternative; boundary="000000000000432c64059249fca8" --000000000000432c64059249fca8 Content-Type: text/plain; charset="UTF-8" Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: > If you say that libraries are not transferred by default and in my case I > haven't used any --py-files then just because the driver python is > different I have facing 6x speed difference ? I am using client mode to > submit the program but the udfs and all are executed in the executors, then > why is the difference so much? > > I tried the prints > For jupyter one the driver prints > ../../jupyter-folder/venv > > and executors print /usr > > For spark-submit both of them print /usr > > The cluster is created few years back and used organisation wide. So how > python 2.6.6 is installed, i honestly do not know. I copied the whole > jupyter from org git repo as it was shared, so i do not know how the venv > was created or python for venv was created even. > > The os is CentOS release 6.9 (Final) > > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:22 PM Abdeali Kothari > wrote: > >> The driver python may not always be the same as the executor python. >> You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON >> >> The dependent libraries are not transferred by spark in any way unless >> you do a --py-files or .addPyFile() >> >> Could you try this: >> *import sys; print(sys.prefix)* >> >> on the driver, and also run this inside a UDF with: >> >> *def dummy(a):* >> * import sys; raise AssertionError(sys.prefix)* >> >> and get the traceback exception on the driver ? >> This would be the best way to get the exact sys.prefix (python path) for >> both the executors and driver. >> >> Also, could you elaborate on what environment is this ? >> Linux? - CentOS/Ubuntu/etc. ? >> How was the py 2.6.6 installed ? >> How was the py 2.7.5 venv created and how what the base py 2.7.5 >> installed ? >> >> Also, how are you creating the Spark Session in jupyter ? >> >> >> On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati >> wrote: >> >>> But would it be the case for multiple tasks running on the same worker >>> and also both the tasks are running in client mode, so the one true is true >>> for both or for neither. As mentioned earlier all the confs are same. I >>> have checked and compared each conf. >>> >>> As Abdeali mentioned it must be because the way libraries are in both >>> the environments. Also i verified by running the same script for jupyter >>> environment and was able to get the same result using the normal script >>> which i was running with spark-submit. >>> >>> Currently i am searching for the ways the python packages are >>> transferred from driver to spark cluster in client mode. Any info on that >>> topic would be helpful. >>> >>> Thanks! >>> >>> >>> >>> On Wed, 11 Sep, 2019, 7:06 PM Patrick McCarthy, >>> wrote: >>> >>>> Are you running in cluster mode? A large virtualenv zip for the driver >>>> sent into the cluster on a slow pipe could account for much of that eight >>>> minutes. >>>> >>>> On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I just ran the same script in a shell in jupyter notebook and find the >>>>> performance to be similar. So I can confirm this is because the libraries >>>>> used jupyter notebook python is different than the spark-submit python this >>>>> is happening. >>>>> >>>>> But now I have a following question. Are the dependent libraries in a >>>>> python script also transferred to the worker machines when executing a >>>>> python script in spark. Because though the driver python versions are >>>>> different, the workers machines will use their same python environment to >>>>> run the code. If anyone can explain this part, it would be helpful. >>>>> >>>>> >>>>> >>>>> >>>>> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* >>>>> >>>>> >>>>> On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati < >>>>> dhruba.work@gmail.com> wrote: >>>>> >>>>>> Just checked from where the script is submitted i.e. wrt Driver, the >>>>>> python env are different. Jupyter one is running within a the virtual >>>>>> environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But >>>>>> the executors have the same python version right? I tried doing a >>>>>> spark-submit from jupyter shell, it fails to find python 2.7 which is not >>>>>> there hence throws error. >>>>>> >>>>>> Here is the udf which might take time: >>>>>> >>>>>> import base64 >>>>>> import zlib >>>>>> >>>>>> def decompress(data): >>>>>> >>>>>> bytecode = base64.b64decode(data) >>>>>> d = zlib.decompressobj(32 + zlib.MAX_WBITS) >>>>>> decompressed_data = d.decompress(bytecode ) >>>>>> return(decompressed_data.decode('utf-8')) >>>>>> >>>>>> >>>>>> Could this because of the two python environment mismatch from Driver side? But the processing >>>>>> >>>>>> happens in the executor side? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *Regards,Dhrub* >>>>>> >>>>>> On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari < >>>>>> abdealikothari@gmail.com> wrote: >>>>>> >>>>>>> Maybe you can try running it in a python shell or >>>>>>> jupyter-console/ipython instead of a spark-submit and check how much time >>>>>>> it takes too. >>>>>>> >>>>>>> Compare the env variables to check that no additional env >>>>>>> configuration is present in either environment. >>>>>>> >>>>>>> Also is the python environment for both the exact same? I ask >>>>>>> because it looks like you're using a UDF and if the Jupyter python has >>>>>>> (let's say) numpy compiled with blas it would be faster than a numpy >>>>>>> without it. Etc. I.E. Some library you use may be using pure python and >>>>>>> another may be using a faster C extension... >>>>>>> >>>>>>> What python libraries are you using in the UDFs? It you don't use >>>>>>> UDFs at all and use some very simple pure spark functions does the time >>>>>>> difference still exist? >>>>>>> >>>>>>> Also are you using dynamic allocation or some similar spark config >>>>>>> which could vary performance between runs because the same resources we're >>>>>>> not utilized on Jupyter / spark-submit? >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 11, 2019, 08:43 Stephen Boesch >>>>>>> wrote: >>>>>>> >>>>>>>> Sounds like you have done your homework to properly compare . I'm >>>>>>>> guessing the answer to the following is yes .. but in any case: are they >>>>>>>> both running against the same spark cluster with the same configuration >>>>>>>> parameters especially executor memory and number of workers? >>>>>>>> >>>>>>>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >>>>>>>> dhruba.work@gmail.com>: >>>>>>>> >>>>>>>>> No, i checked for that, hence written "brand new" jupyter >>>>>>>>> notebook. Also the time taken by both are 30 mins and ~3hrs as i am reading >>>>>>>>> a 500 gigs compressed base64 encoded text data from a hive table and >>>>>>>>> decompressing and decoding in one of the udfs. Also the time compared is >>>>>>>>> from Spark UI not how long the job actually takes after submission. Its >>>>>>>>> just the running time i am comparing/mentioning. >>>>>>>>> >>>>>>>>> As mentioned earlier, all the spark conf params even match in two >>>>>>>>> scripts and that's why i am puzzled what going on. >>>>>>>>> >>>>>>>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>>>>>>>> pmccarthy@dstillery.com> wrote: >>>>>>>>> >>>>>>>>>> It's not obvious from what you pasted, but perhaps the juypter >>>>>>>>>> notebook already is connected to a running spark context, while >>>>>>>>>> spark-submit needs to get a new spot in the (YARN?) queue. >>>>>>>>>> >>>>>>>>>> I would check the cluster job IDs for both to ensure you're >>>>>>>>>> getting new cluster tasks for each. >>>>>>>>>> >>>>>>>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati < >>>>>>>>>> dhruba.work@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am facing a weird behaviour while running a python script. >>>>>>>>>>> Here is what the code looks like mostly: >>>>>>>>>>> >>>>>>>>>>> def fn1(ip): >>>>>>>>>>> some code... >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> def fn2(row): >>>>>>>>>>> ... >>>>>>>>>>> some operations >>>>>>>>>>> ... >>>>>>>>>>> return row1 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> udf_fn1 = udf(fn1) >>>>>>>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 >>>>>>>>>>> Gigs with ~4500 partitions >>>>>>>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>>>>>>>>>> .drop("colz") \ >>>>>>>>>>> .withColumnRenamed("colz", "coly") >>>>>>>>>>> >>>>>>>>>>> edf = ddf \ >>>>>>>>>>> .filter(ddf.colp == 'some_value') \ >>>>>>>>>>> .rdd.map(lambda row: fn2(row)) \ >>>>>>>>>>> .toDF() >>>>>>>>>>> >>>>>>>>>>> print edf.count() // simple way for the performance test in both >>>>>>>>>>> platforms >>>>>>>>>>> >>>>>>>>>>> Now when I run the same code in a brand new jupyter notebook it >>>>>>>>>>> runs 6x faster than when I run this python script using spark-submit. The >>>>>>>>>>> configurations are printed and compared from both the platforms and they >>>>>>>>>>> are exact same. I even tried to run this script in a single cell of jupyter >>>>>>>>>>> notebook and still have the same performance. I need to understand if I am >>>>>>>>>>> missing something in the spark-submit which is causing the issue. I tried >>>>>>>>>>> to minimise the script to reproduce the same error without much code. >>>>>>>>>>> >>>>>>>>>>> Both are run in client mode on a yarn based spark cluster. The >>>>>>>>>>> machines from which both are executed are also the same and from same user. >>>>>>>>>>> >>>>>>>>>>> What i found is the the quantile values for median for one ran >>>>>>>>>>> with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I >>>>>>>>>>> am not able to figure out why this is happening. >>>>>>>>>>> >>>>>>>>>>> Any one faced this kind of issue before or know how to resolve >>>>>>>>>>> this? >>>>>>>>>>> >>>>>>>>>>> *Regards,* >>>>>>>>>>> *Dhrub* >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Patrick McCarthy * >>>>>>>>>> >>>>>>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>>>>>> >>>>>>>>>> Dstillery >>>>>>>>>> >>>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>>>>>>> >>>>>>>>> >>>> >>>> -- >>>> >>>> >>>> *Patrick McCarthy * >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>> --000000000000432c64059249fca8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Also the performance remains identical when running the same script = from jupyter terminal instead or normal terminal. In the script the spark c= ontext is created by=C2=A0

spark =3D SparkSession \
=C2=A0 =C2=A0= .builder \
..
..
getOrCreate() command


On Wed, Sep 11, 2019 at 10:28 PM Dhrubajy= oti Hati <dhruba.work@gmail.com= > wrote:
=
If you say that libraries are not transferred by default and in my c= ase I haven't used any --py-files then just because the driver python i= s different I have facing 6x speed difference ? I am using client mode to s= ubmit the program but the udfs and all are executed in the executors, then = why is the difference so much?

I tried the prints
For jupyter one the driver prints
../../= jupyter-folder/venv

and executors print /usr

For spark-submit both of them print /usr=

=
The = cluster is created few years back and used organisation wide. So how python= 2.6.6 is installed, i honestly do not know.=C2=A0 I copied the whole jupyt= er from org git repo as it was shared, so i do not know how the venv was cr= eated or python for venv was created even.

The os is=C2=A0CentOS release 6.9 (Final)

=

=
=
Regards,

Dhrubajyoti Hati.
Mob No: 9886428028/965= 2029028



On Wed, Sep 11, 2019 at 8= :22 PM Abdeali Kothari <abdealikothari@gmail.com> wrote:
The driver python ma= y not always be the same as the executor python.
You can set these usin= g PYSPARK_PYTHON and=C2=A0PYSPARK_DRIVER_PYTHON

Th= e dependent libraries are not transferred by spark in any way unless you do= a --py-files or .addPyFile()

Could you try this:<= /div>
import sys; print(sys.prefix)

on = the driver, and also run this inside a UDF with:

<= b>def dummy(a):
=C2=A0 =C2=A0=C2=A0import sys; raise Asser= tionError(sys.prefix)

and get the traceback ex= ception on the driver ?
This would be the best way to get the exa= ct sys.prefix (python path) for both the executors and driver.
Also, could you elaborate on what environment is this ?
<= div>Linux? - CentOS/Ubuntu/etc. ?=C2=A0
How was the py 2.6.6 inst= alled ?
How was the py 2.7.5 venv created and how what the base p= y 2.7.5 installed ?

Also, how are you creating the= Spark Session in jupyter ?


On Wed, Sep 11, 2019 at 7:3= 3 PM Dhrubajyoti Hati <dhruba.work@gmail.com> wrote:
But would it be the case= for multiple tasks running on the same worker and also both the tasks are = running in client mode, so the one true is true for both or for neither. As= mentioned earlier all the confs are same. I have checked and compared each= conf.

As Abdeali mentioned it= must be because the=C2=A0 way libraries are in both the environments. Also= i verified by running the same script for jupyter environment and was able= to get the same=C2=A0result using the normal script which i was running wi= th spark-submit.

Current= ly i am searching for the ways the python packages are transferred from dri= ver to spark cluster in client mode. Any info on that topic would be helpfu= l.

Thanks!



On Wed, 11 Sep, 2019, 7:06 = PM Patrick McCarthy, <pmccarthy@dstillery.com> wrote:
Are you running in clus= ter mode? A large virtualenv zip for the driver sent into the cluster on a = slow pipe could account for much of that eight minutes.

On Wed, Sep 11, = 2019 at 3:17 AM Dhrubajyoti Hati <dhruba.work@gmail.com> wrote= :
Hi,=C2= =A0

I just ran the same script in a shell in jupyter notebook and fi= nd the performance to be similar. So I can confirm this is because the libr= aries used jupyter notebook python is different than the spark-submit pytho= n this is happening.=C2=A0

But now I have a following question. Are the dependen= t libraries in a python script also transferred to the worker machines when= executing a python script in spark. Because though the driver python versi= ons are different, the workers machines will use their same python environm= ent to run the code. If anyone can explain this part, it would be helpful.<= /div>

<= /div>
Regards,=

Dhrubajyoti Hati.
Mob No: 9886428028/9652029028



On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati <dhruba.w= ork@gmail.com> wrote:
Just checked from where the script is submitted i.e. wr= t Driver, the python env are different. Jupyter one is running within a the= virtual environment which is Python 2.7.5 and the spark-submit one uses 2.= 6.6. But the executors have the same python version right? I tried doing a = spark-submit from jupyter shell, it fails to find python 2.7=C2=A0 which is= not there hence throws error.

Here is the udf which might take time:
im=
port base64
i= mport zlib
def <=
/span>decompress(data):

bytecode =3D base64.b64decode(data)
= d =3D zlib.decompressobj(32 + = zlib.MAX_WBITS)
decompressed_data =3D d.decompress(bytecode )
= return(decompre= ssed_data.decode('u= tf-8'))

Could this because of the two python environment mismatch from =
Driver side? But the processing
happens in the executor side?

R= egards,

Dhrub

<= div dir=3D"ltr" class=3D"gmail_attr">On Wed, Sep 11, 2019 at 8:59 AM Abdeal= i Kothari <abdealikothari@gmail.com> wrote:
Maybe yo= u can try running it in a python shell or jupyter-console/ipython instead o= f a spark-submit and check how much time it takes too.

Compare the env variables to check that no a= dditional env configuration is present in either environment.

Also is the python environment for bo= th the exact same? I ask because it looks like you're using a UDF and i= f the Jupyter python has (let's say) numpy compiled with blas it would = be faster than a numpy without it. Etc. I.E. Some library you use may be us= ing pure python and another may be using a faster C extension...=C2=A0

What python libraries=C2=A0a= re you using in the UDFs? It you don't use UDFs at all and use some ver= y simple pure spark functions does the time difference still exist?=C2=A0

Also are you using dynamic allo= cation or some similar spark config which could vary performance between ru= ns because the same resources we're not utilized on Jupyter / spark-sub= mit?=C2=A0


On Wed, Sep 11, 2019, 08:43 Stephen Boesch <<= a href=3D"mailto:javadba@gmail.com" rel=3D"noreferrer" target=3D"_blank">ja= vadba@gmail.com> wrote:
Sounds like you have done your homework to = properly compare=C2=A0.=C2=A0 =C2=A0I'm guessing the answer to the foll= owing is yes .. but in any case:=C2=A0 are they both running against the sa= me spark cluster with the same configuration parameters especially executor= memory and number of workers?

Am Di., 10. Sept. 2019 um 20:05=C2=A0Uhr schr= ieb Dhrubajyoti Hati <dhruba.work@gmail.com>:
No,= i checked for that, hence written "brand new" jupyter notebook. = Also the time taken by both are 30 mins and ~3hrs as i am reading a 500=C2= =A0 gigs compressed base64 encoded text data from a hive table and decompre= ssing and decoding in one of the udfs. Also the time compared is from Spark= UI not=C2=A0 how long the job actually takes after submission. Its just th= e running time i am comparing/mentioning.

As mentioned earlier, all the spark conf params even match in t= wo scripts and that's why i am puzzled what going on.

On Wed, 11 S= ep, 2019, 12:44 AM Patrick McCarthy, <pmccarthy@dstiller= y.com> wrote:
It's not obvious from what you pasted, but p= erhaps the juypter notebook already is connected to a running spark context= , while spark-submit needs to get a new spot in the (YARN?) queue.

I would check the cluster job IDs for both to ensure = you're getting new cluster tasks for each.

On Tue, Sep 10, 20= 19 at 2:33 PM Dhrubajyoti Hati <dhruba.work@gma= il.com> wrote:
Hi,

I am facing a weird behaviour while running a python script. He= re is what the code looks like mostly:

def fn1(ip):
=C2=A0 =C2=A0some code...=
=C2=A0 =C2=A0 ...

def fn2(row):
=C2=A0 =C2=A0 ...
=C2=A0 = =C2=A0 some operations
=C2=A0 =C2=A0 ...
=C2=A0 =C2=A0 return row1

udf_fn1 =3D udf(fn1)
cdf =3D spark.read.table("xxxx")= //hive table is of size > 500 Gigs with ~4500 partitions
ddf =3D cdf= .withColumn("coly", udf_fn1(cdf.colz)) \
=C2=A0 =C2=A0 .drop(&= quot;colz") \
=C2=A0 =C2=A0 .withColumnRenamed("colz", &q= uot;coly")

edf =3D ddf \
=C2=A0 =C2=A0 .filter(ddf.colp =3D= =3D 'some_value') \
=C2=A0 =C2=A0 .rdd.map(lambda row: fn2(row))= \
=C2=A0 =C2=A0 .toDF()

print edf.count() // simple way for the = performance test in both platforms

Now when I run the same code in a= brand new jupyter notebook it runs 6x faster than when I run this python s= cript using spark-submit. The configurations are printed and=C2=A0 compared= from both the platforms and they are exact same. I even tried to run this = script in a single cell of jupyter notebook and still have the same perform= ance. I need to understand if I am missing something in the spark-submit wh= ich is causing the issue.=C2=A0 I tried to minimise the script to reproduce= the same error without much code.

Both are run in client mode on a = yarn based spark cluster. The machines from which both are executed are als= o the same and from same user.

What i found is the=C2=A0 the quantile values for= median for one ran with jupyter was 1.3 mins and one ran with spark-submit= was ~8.5 mins.=C2=A0 I am not able to figure out why this is happening.
Any one faced this kind of issue before or know how to resolve this?

Regards,
Dhrub


--

<= b style=3D"font-size:x-small">Patrick McCarthy=C2=A0
=

= Senior Data Scientist, Machine Learning = Engineering

= Dstillery

47= 0 Park Ave South, 17th Floor, NYC 10016

=


--

Patrick McCarthy=C2=A0

<= p style=3D"margin:0in 0in 0.0001pt;text-align:start;word-spacing:0px">Senior Data Scientist, Machine Learning Engin= eering

Dstil= lery

470 Par= k Ave South, 17th Floor, NYC 10016

--000000000000432c64059249fca8--