spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Different watermark for different kafka partitions in Structured Streaming
Date Thu, 31 Aug 2017 01:13:19 GMT
Why not set the watermark to be looser, one that works across all
partitions? The main usage of watermark is to drop state. If you loosen the
watermark threshold (e.g. from 1 hour to 10 hours), then you will keep more
state with older data, but you are guaranteed that you will not drop
important data.

On Wed, Aug 30, 2017 at 7:41 AM, KevinZwx <kevinzwx1992@gmail.com> wrote:

> Hi,
>
> I'm working with Structured Streaming to process logs from kafka and use
> watermark to handle late events. Currently the watermark is computed by
> (max
> event time seen by the engine - late threshold), and the same watermark is
> used for all partitions.
>
> But in production environment it happens frequently that different
> partition
> is consumed at different speed, the consumption of some partitions may be
> left behind, so the newest event time in these partitions may be much
> smaller than than the others'. In this case using the same watermark for
> all
> partitions may cause heavy data loss.
>
> So is there any way to achieve different watermark for different kafka
> partition or any plan to work on this?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message