spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张万新 <kevinzwx1...@gmail.com>
Subject Re: Different watermark for different kafka partitions in Structured Streaming
Date Fri, 01 Sep 2017 08:59:44 GMT
Thanks, it's true that looser watermark can guarantee more data not be
dropped, but at the same time more state need to be kept.   I just consider
if there is sth like kafka-partition-aware watermark in flink in SS may be
a better solution.

Tathagata Das <tathagata.das1565@gmail.com>于2017年8月31日周四 上午9:13写道:

> Why not set the watermark to be looser, one that works across all
> partitions? The main usage of watermark is to drop state. If you loosen the
> watermark threshold (e.g. from 1 hour to 10 hours), then you will keep more
> state with older data, but you are guaranteed that you will not drop
> important data.
>
> On Wed, Aug 30, 2017 at 7:41 AM, KevinZwx <kevinzwx1992@gmail.com> wrote:
>
>> Hi,
>>
>> I'm working with Structured Streaming to process logs from kafka and use
>> watermark to handle late events. Currently the watermark is computed by
>> (max
>> event time seen by the engine - late threshold), and the same watermark is
>> used for all partitions.
>>
>> But in production environment it happens frequently that different
>> partition
>> is consumed at different speed, the consumption of some partitions may be
>> left behind, so the newest event time in these partitions may be much
>> smaller than than the others'. In this case using the same watermark for
>> all
>> partitions may cause heavy data loss.
>>
>> So is there any way to achieve different watermark for different kafka
>> partition or any plan to work on this?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Mime
View raw message