spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kalin Stoyanov <>
Subject Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Date Wed, 15 Jan 2020 17:53:14 GMT
Hi all,

First of all let me say that I am pretty new to Spark so this could be
entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark
2.4.4, and it got done slower than when I had ran it locally (on Spark
2.4.1). I checked out the event logs, and the one from the newer version
had more stages.
Then I decided to do a comparison in the same environment so I created the
two versions of the same cluster with the only difference being the emr
release, and hence the spark version(?) - first one was emr-5.24.1 with
Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
the same thing happened with the newer version having more stages and
taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of
spark itself, but because of some difference introduced in the emr
releases? At the moment I can't think of any other alternative besides it
being a bug...

Here are the two event logs:
and my code is here:

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files
--name sim100_dt100_spark242 s3://kgs-s3/scripts/ 100 100
--outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said
it's better to ask here first, so any ideas on what's going on? Maybe I am
missing something?


View raw message