You don’t need to “launch batches” every 5 minutes. You can launch batches every 2 seconds, and aggregate on window for 5 minutes. Spark will read data from topic every 2 seconds, and keep the data in memory for 5 minutes.
You need to make few decisions
window(df_temp.timestamp, "2 minutes", "1 minutes")
This is a rolling window. Here second parameter(2 minutes) is the window interval, and third parameter(1 minutes) is the slide interval. In the above example, it will produce an aggregate every 1 minute interval for 2minute worth of data.
If you define
window(df_temp.timestamp, "2 minutes", "2 minutes")
This is a tumbling window. It will produce an aggregate every 2 minutes, with 2 minutes worth of data
It might help if you try to follow the example in this guide very carefully http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time That is a pretty good example, but you need to follow it event by event very carefully to get all the nuances.
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
Hi, I'm new on Apache Spark.
I'm trying to read data from an Apache Kafka topic (I have a simulated temperature sensor producer which sends data every 2 second) and I need every 5 minutes to calculate the average temperature. Reading documentation I understand I need
to use windows but I'm not able to finalize my code. Can some help me?
How can I launch batches every 5 minutes? My code works one time and finishes. Why in the console I can't find any helpful information for correct execution? See attached picture.
This is my code:
Thanks for your precious help.
PhD. Giuseppe Ricci