spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <jlalw...@amazon.com.INVALID>
Subject Re: Calculate average from Spark stream
Date Mon, 10 May 2021 16:14:48 GMT
You don’t need to “launch batches” every 5 minutes. You can launch batches every 2 seconds,
and aggregate on window for 5 minutes. Spark will read data from topic every 2 seconds, and
keep the data in memory for 5 minutes.

You need to make few decisions

  1.  DO you want a tumbling window or a rolling window? A tumbling window of 5 minutes will
produce an aggregate every 5 minutes. It will aggregate data for 5 minutes before. A rolling
window of 5 miutes/1 minute, will produce an aggregate ever 1 minute. It will aggregate data
ever 1 minute. For example, let’s say you have data evert 2 seconds. A tumbling window will
produce a result on minute 5, 10, 15, 20…. Minute 5 result will have data from minute 1-4.,
15 will have data from 6-10… and so on. Rolling window will produce data on minute 5, 6,
7, 8, …. Minute 5 will have aggregate from 1-5, minute 6 will have aggregate from 2-6, and
so on. This defines your window. In your code you have


window(df_temp.timestamp, "2 minutes", "1 minutes")

This is a rolling window. Here second parameter(2 minutes) is the window interval, and third
parameter(1 minutes) is the slide interval. In the above example, it will produce an aggregate
every 1 minute interval for 2minute worth of data.

If you define


window(df_temp.timestamp, "2 minutes", "2 minutes")

This is a tumbling window. It will produce an aggregate every 2 minutes, with 2 minutes worth
of data





  1.  Can you have late data? How late can data arrive? Usually streaming systems send data
out of order. Liik, it could happen that you get data for t=11:00:00 AM, and then get data
for t=10:59:59AM. This means that the data is late by 1 second. What’s the worst case condition
for late data? You need to define the watermark for late data. In your code, you have defined
a watermark of 2 minutes. For aggregations, the watermark also defines which windows Spark
will keep in memory. If you define a watermark of 2 minutes, and you have a rolling window
with slide interval of 1 minute, Spark will keep 2 windows in memory. Watermark interval affects
how much memory will be used by Spark

It might help if you try to follow the example in this guide very carefully http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
That is a pretty good example, but you need to follow it event by event very carefully to
get all the nuances.

From: Giuseppe Ricci <peppepegasus@gmail.com>
Date: Monday, May 10, 2021 at 11:19 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXTERNAL] Calculate average from Spark stream


CAUTION: This email originated from outside of the organization. Do not click links or open
attachments unless you can confirm the sender and know the content is safe.


Hi, I'm new on Apache Spark.
I'm trying to read data from an Apache Kafka topic (I have a simulated temperature sensor
producer which sends data every 2 second) and I need every 5 minutes to calculate the average
temperature. Reading documentation I understand I need to use windows but I'm not able to
finalize my code. Can some help me?
How can I launch batches every 5 minutes? My code works one time and finishes. Why in the
console I can't find any helpful information for correct execution? See attached picture.

This is my code:
https://pastebin.com/4S31jEeP

Thanks for your precious help.



PhD. Giuseppe Ricci

Mime
View raw message