spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Date Wed, 15 Jan 2020 19:15:01 GMT

I am pretty sure that AWS has released 5.28.1 with some bug fixes day
before yesterday.

Also please ensure that you are using s3:// instead of s3a:// or anything
like that.

On another note, Xiao, is not entirely right in mentioning about issues in
EMR not to be posted here, a large group of users use SPARK in Databricks,
GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have
always found that the Apache SPARK community takes care of each other and
answers questions to the largest user base, just like I did now. I think
that only Matei Zaharia can take such a sweeping call on what this entire
community is about.

Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov <> wrote:

> Hi all,
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
> Here are the two event logs:
> and my code is here:
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/,s3://kgs-s3/scripts/,s3://kgs-s3/scripts/
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/ 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
> Regards,
> Kalin

View raw message