OK that is what I guessed happening.You have few options to try in your SIT environmentFirst try adding that partition_id to the groupby composite key (account, product,partition_id) which effectively adds Salt.How many rows are being delivered in your streaming data in batch interval?HTH
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
On Sun, 11 Apr 2021 at 10:17, András Kolbert <email@example.com> wrote:Hi,
The groupby is by account, and product now. There is definitely some skewness (as one expects, most users have interactions with small no of items.
What happens at the moment:
- Groupby (account, product)
- Calculate partition (new_user_item_agg = new_user_item_agg.withColumn("partition_id", (F.abs(F.hash(col("account_name"))) % (5*(executor_count))))
- Repartition: new_user_item_agg = new_user_item_agg.repartition("partition_id")
- Write to HDFS:
.save(hdfs_path)Maybe I could add the partition_id to the groupby already and that would give me some performance increase?The biggest issue is this:
Whenever you see some spikes, that's because executors died during the processing.
This is the status after:The ideal situation would be to optimise the flow and not let the executors die constantly.