kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black <benblac...@gmail.com>
Subject Re: Aggregated windowed counts
Date Wed, 04 Jan 2017 22:09:42 GMT
I'm hoping the DSL will do what I want :) Currently the example is
continuously adding instead of bucketing, so if I modify it by adding a
window to the count function:

.groupBy((key, word) -> word)
.count(TimeWindows.of(5000L), "Counts")
.toStream((k, v) -> k.key());

Then I do see bucketing happening. However, it isn't accurate. For example,
I type into the console "kafka" as 20 sentences, but the output I get is:

kafka 4
kafka 9
kafka 2
kafka 7

Which equals 22. What am I doing wrong? What is the relationship between
commit interval and time window. The smaller I make commit interval, the
less accurate it becomes.


On Wed, Jan 4, 2017 at 3:53 PM Matthias J. Sax <matthias@confluent.io>
wrote:

> Do you know about Kafka Streams? It's DSL gives you exactly what you
> want to do.
>
> Check out the documentation and WordCount example:
>
> http://docs.confluent.io/current/streams/index.html
>
> https://github.com/confluentinc/examples/blob/3.1.x/kafka-streams/src/main/java/io/confluent/examples/streams/WordCountLambdaExample.java
>
>
> Let us know if you have further questions.
>
>
> -Matthias
>
> On 1/4/17 12:48 PM, Benjamin Black wrote:
> > Hello,
> >
> > I'm looking for guidance on how to approach a counting problem. We want
> to
> > consume a stream of data that consists of IDs and generate an output of
> the
> > aggregated count with a window size of X seconds using processing time
> and
> > a hopping time window. For example, using a window size of 1 second, if
> we
> > get IDs 1, 2, 2, 2 in the 1st second, then the output would be 1=1, 2=3.
> If
> > we get IDs 1, 3, 3 in the 2nd second then the output would be 1=1, 3=2.
> The
> > aggregated count will then be turned into increment commands to a cache
> and
> > a database.
> >
> > Obviously we will need some state to be stored during the count of a
> > window, but we only need to keep it for the time period of the window
> (i.e.
> > a second). I was thinking this could be achieved by using a persistent
> > store, where the counts are reset during the punctuate and the store
> topic
> > uses log compression. Alternatively, we could simple have an in memory
> > store that is reset during the punctuate. My concern with the in memory
> > store is that I don't know when the input topic offset is committed or
> when
> > the output data is written and therefore we could lose data. Ultimately,
> at
> > the end of the second, the input offset and output data should be written
> > at the same time, reducing the likelihood of lost data. We would rather
> > lose data, than have duplicate counts. What is the correct approach? Is
> > there a better way of tackling the problem?
> >
> > I have put together some code, but it doesn't do exactly what I expect.
> I'm
> > happy to share if it helps.
> >
> > Thanks,
> > Ben
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message