kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eno Thereska <eno.there...@gmail.com>
Subject Re: Kafka Streams: consume 6 months old data VS windows maintain durations
Date Fri, 13 Jan 2017 11:38:43 GMT
That should work.

Thanks
Eno
> On 12 Jan 2017, at 21:21, Nicolas Fouché <nfouche@onfocus.io> wrote:
> 
> Thanks Eno !
> 
> My intention is to reprocess all the data from the beginning. And we'll
> reset the application as documented in the Confluent blog.
> We don't want to keep the previous results; in fact, we want to overwrite
> them. Kafka Connect will happily replace all records in our sink database.
> 
> So I'll reset the streams app, them change the window duration times to 6
> months until the application processes fresh messages, and then I'll
> restart the application with the original window duration time (without a
> reset this time). Let's hope Kafka Streams will detect this window duration
> change and drop old windows immediately ?
> 
> 
> 2017-01-12 17:06 GMT+01:00 Eno Thereska <eno.thereska@gmail.com>:
> 
>> Hi Nicolas,
>> 
>> I've seen your previous message thread too. I think your best bet for now
>> is to increase the window duration time, to 6 months.
>> 
>> If you change your application logic, e.g., by changing the duration time,
>> the semantics of the change wouldn't immediate be clear and it's worth
>> clarifying those. For example, would the intention be to reprocess all the
>> data from the beginning? Or start where you left off (in which case the
>> fact that the original processing went over data that is 6 month old would
>> not be relevant, since you'd start from where you left off the second
>> time)? Right now we support a limited way to reprocess the data by
>> effectively resetting a streams application (https://www.confluent.io/
>> blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
>> <https://www.confluent.io/blog/data-reprocessing-with-
>> kafka-streams-resetting-a-streams-application/>). I wouldn't recommend
>> using that if you want to keep the results of the previous run though.
>> 
>> Eno
>> 
>>> On 12 Jan 2017, at 09:15, Nicolas Fouché <nfouche@onfocus.io> wrote:
>>> 
>>> Hi.
>>> 
>>> 
>>> I'd like to re-consume 6 months old data with Kafka Streams.
>>> 
>>> My current topology can't because it defines aggregations with windows
>> maintain durations of 3 days.
>>> TimeWindows.of(ONE_HOUR_MILLIS).until(THREE_DAYS_MILLIS)
>>> 
>>> 
>>> 
>>> As discovered (and shared [1]) a few months ago, consuming a record
>> older than 3 days will mess up my aggregates. How do you deal with this ?
>> Do you temporarily raise the windows maintain durations until all records
>> are consumed ? Do you always run your topologies with long durations, like
>> a year ? I have no idea what would be the impact on the RAM and disk, but I
>> guess RocksDB would cry a little.
>>> 
>>> 
>>> Final question: il I raise the duration to 6 months, consume my records,
>> and then set the duration back to 3 days, would the old aggregates
>> automatically destroyed ?
>>> 
>>> 
>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/201610.mbox/%
>> 3cCABQKjkJ42N7z4BxJDKrDYZ_kmpunH738uxvm7gy24dnkx+RvVw@mail.gmail.com%3e
>>> 
>>> 
>>> Thanks
>>> Nicolas
>>> 
>>> 
>> 
>> 


Mime
View raw message