spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AndrĂ¡s Kolbert <kolbertand...@gmail.com>
Subject Re: Tasks are skewed to one executor
Date Sun, 11 Apr 2021 09:16:48 GMT
Hi,

The groupby is by account, and product now. There is definitely some
skewness (as one expects, most users have interactions with small no of
items.

What happens at the moment:

   1. Union
   2. Groupby (account, product)
   3. Calculate partition (new_user_item_agg =
   new_user_item_agg.withColumn("partition_id",
   (F.abs(F.hash(col("account_name"))) % (5*(executor_count))))
   4. Repartition: new_user_item_agg =
   new_user_item_agg.repartition("partition_id")
   5. Write to HDFS:
   df.write \
           .partitionBy("partition_id") \
           .mode("overwrite") \
           .save(hdfs_path)

Maybe I could add the partition_id to the groupby already and that would
give me some performance increase?

The biggest issue is this:
[image: image.png]

Whenever you see some spikes, that's because executors died during the
processing.

[image: image.png]

This is the status after:
[image: image.png]


The ideal situation would be to optimise the flow and not let the executors
die constantly.

Mime
View raw message