spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AndrĂ¡s Kolbert <>
Subject Re: Tasks are skewed to one executor
Date Sun, 11 Apr 2021 09:16:48 GMT

The groupby is by account, and product now. There is definitely some
skewness (as one expects, most users have interactions with small no of

What happens at the moment:

   1. Union
   2. Groupby (account, product)
   3. Calculate partition (new_user_item_agg =
   (F.abs(F.hash(col("account_name"))) % (5*(executor_count))))
   4. Repartition: new_user_item_agg =
   5. Write to HDFS:
   df.write \
           .partitionBy("partition_id") \
           .mode("overwrite") \

Maybe I could add the partition_id to the groupby already and that would
give me some performance increase?

The biggest issue is this:
[image: image.png]

Whenever you see some spikes, that's because executors died during the

[image: image.png]

This is the status after:
[image: image.png]

The ideal situation would be to optimise the flow and not let the executors
die constantly.

View raw message