spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Date Wed, 15 Jan 2020 22:15:46 GMT
Hi Xiao,

that is the right attitude, thanks a ton :)

Hi Kalin,
EMR latest version should be available right out of the box, perhaps you
can raise a quick AWS ticket and find out in case its release it getting
delayed in your region or not. The release notes does mention that it fixes
a few SPARK compatibility issues. Also working on the latest version of
SPARK takes less than 10 seconds after you have downloaded and unzipped the
file from APACHE SPARK. Besides that I am almost always sure that starting
SPARK session in EMR using the following statement is always going to give
the same performance and predictability. As Xiao mentions it might be
better to first isolate the cause and replicate it before raising issues.

(spark = SparkSession.builder.getOrCreate())

Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov <> wrote:

> Hi all,
> @Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
> the google drive -
> That is what you had in mind right? They are different indeed. (For some
> reason after I saved them off of the history server the graphs get drawn
> twice, but that shouldn't matter)
> @Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
> cluster, so I can't check that for now; also I am using just s3://
> @Xiao, Yes, I will try to run this locally as well, but installing new
> versions of Spark won't be very fast and easy for me, so I won't be doing
> it right away.
> Regards,
> Kalin
> On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <> wrote:
>> If you can confirm that this is caused by Apache Spark, feel free to open
>> a JIRA. In each release, I do not expect your queries should hit such a
>> major performance regression. Also, please try the 3.0 preview releases.
>> Thanks,
>> Xiao
>> Kalin Stoyanov <> 于2020年1月15日周三 上午10:53写道:
>>> Hi Xiao,
>>> Thanks, I didn't know that. This
>>> implies that their fork is not used in emr 5.27. I tried that and it has
>>> the same issue. But then again in their article they were comparing emr
>>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>>> version of Spark locally and make the comparison that way.
>>> Regards,
>>> Kalin
>>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <> wrote:
>>>> EMR is having their own fork of Spark, called EMR runtime. They are not
>>>> Apache Spark. You might need to talk with them instead of posting questions
>>>> in the Apache Spark community.
>>>> Cheers,
>>>> Xiao
>>>> Kalin Stoyanov <> 于2020年1月15日周三 上午9:53写道:
>>>>> Hi all,
>>>>> First of all let me say that I am pretty new to Spark so this could be
>>>>> entirely my fault somehow...
>>>>> I noticed this when I was running a job on an amazon emr cluster with
>>>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>>>> version had more stages.
>>>>> Then I decided to do a comparison in the same environment so I created
>>>>> the two versions of the same cluster with the only difference being the
>>>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>>>> the same thing happened with the newer version having more stages and
>>>>> taking almost twice as long to finish.
>>>>> So I am pretty much at a loss here - could it be that it is not
>>>>> because of spark itself, but because of some difference introduced in
>>>>> emr releases? At the moment I can't think of any other alternative besides
>>>>> it being a bug...
>>>>> Here are the two event logs:
>>>>> and my code is here:
>>>>> I ran it like so on the clusters (after putting it on s3):
>>>>> spark-submit --deploy-mode cluster --py-files
>>>>> s3://kgs-s3/scripts/,s3://kgs-s3/scripts/,s3://kgs-s3/scripts/
>>>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/ 100 100
>>>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>>> So yeah I was considering submitting a bug report, but in the guide it
>>>>> said it's better to ask here first, so any ideas on what's going on?
>>>>> I am missing something?
>>>>> Regards,
>>>>> Kalin

View raw message