I did a repartition to 10000 (hardcoded) before the keyBy and it ends in 1.2 minutes.
The questions remain open, because I don't want to harcode paralellism.

El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tueropedro@gmail.com) escribió:
128 is the default parallelism defined for the cluster.
The question now is why keyBy operation is using default parallelism instead of the number of partition of the RDD created by the previous step (5580).
Any clues?

El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tueropedro@gmail.com) escribió:
I am running a job in spark (using aws emr) and some stages are taking a lot more using spark  2.4 instead of Spark 2.3.1:

Spark 2.4:

Spark 2.3.1:

With Spark 2.4, the keyBy operation take more than 10X what it took with Spark 2.3.1
It seems to be related to the number of tasks / partitions.

- Is it not supposed that the number of task of a job is related to number of parts of the RDD left by the previous job? Did that change in version 2.4??
- Which tools/ configuration may I try, to reduce this aberrant downgrade of performance??