kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Fouché <nfou...@onfocus.io>
Subject Re: Kafka Streams: consume 6 months old data VS windows maintain durations
Date Thu, 12 Jan 2017 21:21:07 GMT
Thanks Eno !

My intention is to reprocess all the data from the beginning. And we'll
reset the application as documented in the Confluent blog.
We don't want to keep the previous results; in fact, we want to overwrite
them. Kafka Connect will happily replace all records in our sink database.

So I'll reset the streams app, them change the window duration times to 6
months until the application processes fresh messages, and then I'll
restart the application with the original window duration time (without a
reset this time). Let's hope Kafka Streams will detect this window duration
change and drop old windows immediately ?


2017-01-12 17:06 GMT+01:00 Eno Thereska <eno.thereska@gmail.com>:

> Hi Nicolas,
>
> I've seen your previous message thread too. I think your best bet for now
> is to increase the window duration time, to 6 months.
>
> If you change your application logic, e.g., by changing the duration time,
> the semantics of the change wouldn't immediate be clear and it's worth
> clarifying those. For example, would the intention be to reprocess all the
> data from the beginning? Or start where you left off (in which case the
> fact that the original processing went over data that is 6 month old would
> not be relevant, since you'd start from where you left off the second
> time)? Right now we support a limited way to reprocess the data by
> effectively resetting a streams application (https://www.confluent.io/
> blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
> <https://www.confluent.io/blog/data-reprocessing-with-
> kafka-streams-resetting-a-streams-application/>). I wouldn't recommend
> using that if you want to keep the results of the previous run though.
>
> Eno
>
> > On 12 Jan 2017, at 09:15, Nicolas Fouché <nfouche@onfocus.io> wrote:
> >
> > Hi.
> >
> >
> > I'd like to re-consume 6 months old data with Kafka Streams.
> >
> > My current topology can't because it defines aggregations with windows
> maintain durations of 3 days.
> > TimeWindows.of(ONE_HOUR_MILLIS).until(THREE_DAYS_MILLIS)
> >
> >
> >
> > As discovered (and shared [1]) a few months ago, consuming a record
> older than 3 days will mess up my aggregates. How do you deal with this ?
> Do you temporarily raise the windows maintain durations until all records
> are consumed ? Do you always run your topologies with long durations, like
> a year ? I have no idea what would be the impact on the RAM and disk, but I
> guess RocksDB would cry a little.
> >
> >
> > Final question: il I raise the duration to 6 months, consume my records,
> and then set the duration back to 3 days, would the old aggregates
> automatically destroyed ?
> >
> >
> > [1] http://mail-archives.apache.org/mod_mbox/kafka-users/201610.mbox/%
> 3cCABQKjkJ42N7z4BxJDKrDYZ_kmpunH738uxvm7gy24dnkx+RvVw@mail.gmail.com%3e
> >
> >
> > Thanks
> > Nicolas
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message