spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Tuero <tuerope...@gmail.com>
Subject Spark 2.4 partitions and tasks
Date Thu, 07 Feb 2019 18:30:45 GMT
Hi,
I am running a job in spark (using aws emr) and some stages are taking a
lot more using spark  2.4 instead of Spark 2.3.1:

Spark 2.4:
[image: image.png]

Spark 2.3.1:
[image: image.png]

With Spark 2.4, the keyBy operation take more than 10X what it took with
Spark 2.3.1
It seems to be related to the number of tasks / partitions.

Questions:
- Is it not supposed that the number of task of a job is related to number
of parts of the RDD left by the previous job? Did that change in version
2.4??
- Which tools/ configuration may I try, to reduce this aberrant downgrade
of performance??

Thanks.
Pedro.

Mime
View raw message