kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From giselle.vandongen@klarrio.com <giselle.vandon...@klarrio.com>
Subject Kafka streams dropping events in join when reading from earliest message on topic
Date Fri, 14 Jun 2019 12:05:58 GMT
I have two streams of data flowing into a Kafka cluster. I want to process this data with Kafka
streams. The stream producers are started at some time t.

I start up the Kafka Streams job 5 minutes later and start reading from earliest from both
topics (about 900 000 messages already on each topic). The job parses the data and joins the
two streams. On the intermediate topics, I see that all the earlier 2x900000 events are flowing
through until the join. However, only 250 000 are outputted from the join, instead of 900
000. 

After processing the burst, the code works perfect on the new incoming events.

Grace ms of the join is put on 50 millis but putting it on 5 minutes or 10 minutes makes no
difference. My other settings:

    auto.offset.reset = earliest
    commit.interval.ms = 1000
    batch.size = 204800
    max.poll.records = 500000

The join is on the following window:

JoinWindows.of(Duration.ofMillis(1000))
      .grace(Duration.ofMillis(50))

As far as I understand these joins should be done on event time. What that can cause these
results to be discarded?


Mime
View raw message