Hi,
I am running a job in spark (using aws emr) and some stages are taking a
lot more using spark 2.4 instead of Spark 2.3.1:
Spark 2.4:
[image: image.png]
Spark 2.3.1:
[image: image.png]
With Spark 2.4, the keyBy operation take more than 10X what it took with
Spark 2.3.1
It seems to be related to the number of tasks / partitions.
Questions:
- Is it not supposed that the number of task of a job is related to number
of parts of the RDD left by the previous job? Did that change in version
2.4??
- Which tools/ configuration may I try, to reduce this aberrant downgrade
of performance??
Thanks.
Pedro.
|
Mime |
- Unnamed multipart/related (inline, None, 0 bytes)
- Unnamed multipart/alternative (inline, None, 0 bytes)
|