spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Platania <>
Subject Spark Streaming - Is window() caching DStreams?
Date Fri, 27 May 2016 20:16:35 GMT
Dear all,
Can someone please explain me how Spark Streaming executes the window() operation? From the
Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory,
but looking at the web UI it seems that operations already executed in previous batches are
executed again. For your convenience, I attach a screenshot of my running application below.
By looking at the web UI, it seems that flatMapValues() RDDs are cached (green spot - this
is the last operation executed before I call window() on the DStream), but, at the same time,
it also seems that all the transformations that led to flatMapValues() in previous batches
are executed again. If this is the case, the window() operation may induce huge performance
penalties, especially if the window duration is 1 or 2 hours (as I expect for my application).
Do you think that checkpointing the DStream at that time can be helpful? Consider that the
expected slide window is about 5 minutes.
Hope someone can clarify this point.
View raw message