spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Safaric <dominiksafa...@gmail.com>
Subject Streaming 2.1.0 - window vs. batch duration
Date Thu, 16 Mar 2017 19:34:49 GMT
Hi all,

As I’ve implemented a streaming application pulling data from Kafka every 1 second (batch
interval), I am observing some quite strange behaviour (didn’t use Spark extensively in
the past, but continuous operator based engines instead of). 

Namely the dstream.window(Seconds(60)) windowed stream when written back to Kafka contains
more messages then they were consumed (for debugging purposes using a small dataset of a million
Kafka byte array deserialized messages). In particular, in total I’ve streamed exactly 1
million messages, whereas upon window expiry 60 million messages are written back to Kafka.


I’ve read on the official docs that both the window and window slide duration must be multiples
of the batch interval. Does this mean that when consuming messages between two windows every
batch interval the RDDs of a given batch interval t the same batch is being ingested 59 more
times into the windowed stream? 

If I would like to achieve this behaviour (batch every being equal to a second, window duration
60 seconds) - how might one achieve this? 

I would appreciate if anyone could correct me if I got the internals of Spark’s windowed
operations wrong and elaborate a bit. 

Thanks,
Dominik
Mime
View raw message