kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: joining two windowed aggregations
Date Wed, 03 May 2017 23:35:06 GMT
> Seems like this would be a standard join operation 

Not sure, if I would consider this a "standard" join... Your windows
have different size and thus "move" differently fast.

Kafka Stream joins provide sliding join semantics. Similar to a SQL
query like this (conceptually):

> SELECT * FROM stream1, stream2
>     stream1.key = stream2.key
>     AND
>     stream1.ts - windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize

Ie, it joins all records that are timely close to each other -- what is
a very natural way to express a stream join (what I would consider a
"standard" join). (Btw: this applies _one_ window over both streams)

The hopping window semantics you describe, are also possible with a
little extra work though:

You would first use an aggregation that "collects" your windows (ie,
your aggregation function build up a list of input records as
aggregation result). You can apply this to both your streams with
according TimeWindow.of().advanceBy() settings.

For the join itself, you would merge both streams via
`KStreamBuilder#merge` and apply a stateful (Value)Transformer
downstream. In your Transformer#process you can write custom logic to
compare your windows.

Does this help?


On 5/3/17 10:51 AM, Jon Yeargers wrote:
> I want to collect data in two windowed groups - 4 hours with a one hour
> overlap and a 5 minute / 1 minute. I want to compare the values in the
> _oldest_ window for each group.
> Seems like this would be a standard join operation but Im not clear on how
> to limit which window the join operates on. I could keep a timestamp in
> each aggregate and if it isn't what I want (IE < 4 hours old) then ignore
> the join but this seems v inefficient.
> Likely Im missing the big-picture here again w/re KStreams. I keep running
> into situations where it seems like Kafka Streams would be a great tool but
> it just doesn't quite fit. Kind of like having a drawer with mixed
> metric/std wrenches.

View raw message