spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Tuero <tuerope...@gmail.com>
Subject Re: Spark 2.4 partitions and tasks
Date Fri, 08 Feb 2019 15:50:43 GMT
128 is the default parallelism defined for the cluster.
The question now is why keyBy operation is using default parallelism
instead of the number of partition of the RDD created by the previous step
(5580).
Any clues?

El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tueropedro@gmail.com)
escribió:

> Hi,
> I am running a job in spark (using aws emr) and some stages are taking a
> lot more using spark  2.4 instead of Spark 2.3.1:
>
> Spark 2.4:
> [image: image.png]
>
> Spark 2.3.1:
> [image: image.png]
>
> With Spark 2.4, the keyBy operation take more than 10X what it took with
> Spark 2.3.1
> It seems to be related to the number of tasks / partitions.
>
> Questions:
> - Is it not supposed that the number of task of a job is related to number
> of parts of the RDD left by the previous job? Did that change in version
> 2.4??
> - Which tools/ configuration may I try, to reduce this aberrant downgrade
> of performance??
>
> Thanks.
> Pedro.
>

Mime
View raw message