I am running a job in spark (using aws emr) and some stages are taking a lot more using spark 2.4 instead of Spark 2.3.1:
With Spark 2.4, the keyBy operation take more than 10X what it took with Spark 2.3.1
It seems to be related to the number of tasks / partitions.
- Is it not supposed that the number of task of a job is related to number of parts of the RDD left by the previous job? Did that change in version 2.4??
- Which tools/ configuration may I try, to reduce this aberrant downgrade of performance??