You don’t need to “launch batches” every 5 minutes. You can launch batches every 2 seconds, and aggregate on window for 5 minutes. Spark will read data from topic every 2 seconds, and keep the data in memory for 5 minutes.

 

You need to make few decisions

  1. DO you want a tumbling window or a rolling window? A tumbling window of 5 minutes will produce an aggregate every 5 minutes. It will aggregate data for 5 minutes before. A rolling window of 5 miutes/1 minute, will produce an aggregate ever 1 minute. It will aggregate data ever 1 minute. For example, let’s say you have data evert 2 seconds. A tumbling window will produce a result on minute 5, 10, 15, 20…. Minute 5 result will have data from minute 1-4., 15 will have data from 6-10… and so on. Rolling window will produce data on minute 5, 6, 7, 8, …. Minute 5 will have aggregate from 1-5, minute 6 will have aggregate from 2-6, and so on. This defines your window. In your code you have

window(df_temp.timestamp, "2 minutes", "1 minutes")

This is a rolling window. Here second parameter(2 minutes) is the window interval, and third parameter(1 minutes) is the slide interval. In the above example, it will produce an aggregate every 1 minute interval for 2minute worth of data.

If you define

window(df_temp.timestamp, "2 minutes", "2 minutes")

This is a tumbling window. It will produce an aggregate every 2 minutes, with 2 minutes worth of data

 

 

  1. Can you have late data? How late can data arrive? Usually streaming systems send data out of order. Liik, it could happen that you get data for t=11:00:00 AM, and then get data for t=10:59:59AM. This means that the data is late by 1 second. What’s the worst case condition for late data? You need to define the watermark for late data. In your code, you have defined a watermark of 2 minutes. For aggregations, the watermark also defines which windows Spark will keep in memory. If you define a watermark of 2 minutes, and you have a rolling window with slide interval of 1 minute, Spark will keep 2 windows in memory. Watermark interval affects how much memory will be used by Spark

 

It might help if you try to follow the example in this guide very carefully http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time That is a pretty good example, but you need to follow it event by event very carefully to get all the nuances.

 

From: Giuseppe Ricci <peppepegasus@gmail.com>
Date: Monday, May 10, 2021 at 11:19 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXTERNAL] Calculate average from Spark stream

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hi, I'm new on Apache Spark.

I'm trying to read data from an Apache Kafka topic (I have a simulated temperature sensor producer which sends data every 2 second) and I need every 5 minutes to calculate the average temperature. Reading documentation I understand I need to use windows but I'm not able to finalize my code. Can some help me?
How can I launch batches every 5 minutes? My code works one time and finishes. Why in the console I can't find any helpful information for correct execution? See attached picture.

This is my code:

https://pastebin.com/4S31jEeP

 

Thanks for your precious help.

 

 

 

PhD. Giuseppe Ricci