spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张万新 <>
Subject Different watermark for different kafka partitions in Structured Streaming
Date Wed, 30 Aug 2017 14:38:16 GMT

I'm working with Structured Streaming to process logs from kafka and use
watermark to handle late events. Currently the watermark is computed by (max
event time seen by the engine - late threshold), and the same watermark is
used for all partitions.

But in production environment it happens frequently that different
partition is consumed at different speed, the consumption of some
partitions may be left behind, so the newest event time in these partitions
may be much smaller than than the others'. In this case using the same
watermark for all partitions may cause heavy data loss.

So is there any way to achieve different watermark for different kafka
partition or any plan to work on this?

View raw message