spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <>
Subject Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Date Wed, 15 Jan 2020 20:20:27 GMT
If you can confirm that this is caused by Apache Spark, feel free to open a
JIRA. In each release, I do not expect your queries should hit such a major
performance regression. Also, please try the 3.0 preview releases.



Kalin Stoyanov <> 于2020年1月15日周三 上午10:53写道:

> Hi Xiao,
> Thanks, I didn't know that. This
> implies that their fork is not used in emr 5.27. I tried that and it has
> the same issue. But then again in their article they were comparing emr
> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
> version of Spark locally and make the comparison that way.
> Regards,
> Kalin
> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <> wrote:
>> EMR is having their own fork of Spark, called EMR runtime. They are not
>> Apache Spark. You might need to talk with them instead of posting questions
>> in the Apache Spark community.
>> Cheers,
>> Xiao
>> Kalin Stoyanov <> 于2020年1月15日周三 上午9:53写道:
>>> Hi all,
>>> First of all let me say that I am pretty new to Spark so this could be
>>> entirely my fault somehow...
>>> I noticed this when I was running a job on an amazon emr cluster with
>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>> version had more stages.
>>> Then I decided to do a comparison in the same environment so I created
>>> the two versions of the same cluster with the only difference being the emr
>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>> the same thing happened with the newer version having more stages and
>>> taking almost twice as long to finish.
>>> So I am pretty much at a loss here - could it be that it is not because
>>> of spark itself, but because of some difference introduced in the emr
>>> releases? At the moment I can't think of any other alternative besides it
>>> being a bug...
>>> Here are the two event logs:
>>> and my code is here:
>>> I ran it like so on the clusters (after putting it on s3):
>>> spark-submit --deploy-mode cluster --py-files
>>> s3://kgs-s3/scripts/,s3://kgs-s3/scripts/,s3://kgs-s3/scripts/
>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/ 100 100
>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>> So yeah I was considering submitting a bug report, but in the guide it
>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>> I am missing something?
>>> Regards,
>>> Kalin

View raw message