spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shao, Saisai" <>
Subject RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Date Wed, 23 Jul 2014 12:12:59 GMT
Yeah, the document may not be precisely aligned with latest code, so the best way is to check
the code.

-----Original Message-----
From: Haopu Wang [] 
Sent: Wednesday, July 23, 2014 5:56 PM
Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

Jerry, thanks for the response.

For the default storage level of DStream, it looks like Spark's document is wrong. In this
It mentions:
"Default persistence level of DStreams: Unlike RDDs, the default persistence level of DStreams
serializes the data in memory (that is, StorageLevel.MEMORY_ONLY_SER for DStream compared
to StorageLevel.MEMORY_ONLY for RDDs). Even though keeping the data serialized incurs higher
serialization/deserialization overheads, it significantly reduces GC pauses."

I will take a look at DStream.scala although I have no Scala experience.

-----Original Message-----
From: Shao, Saisai [] 
Sent: 2014年7月23日 15:13
Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

Hi Haopu, 

Please see the inline comments.


-----Original Message-----
From: Haopu Wang [] 
Sent: Wednesday, July 23, 2014 3:00 PM
Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"

I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory usage seems to be stable.

[question] In this case, because input RDDs are persisted but they don't fit into memory,
so write to disk, right? And where can I see the details about these RDDs? I don't see them
in web UI.

[answer] Yes, if memory is not enough to put input RDDs, this data will be flush to disk,
because the default storage level is "MEMORY_AND_DISK_SER_2" as you can see in StreamingContext.scala.
Actually you cannot not see the input RDD in web UI, you can only see the cached RDD in web

Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" folder and
JVM's used heap size are reduced regularly.

[question] In this case, because I didn't change "spark.cleaner.ttl", which component is doing
the cleanup? And what's the difference if I set "spark.cleaner.ttl" to some duration in this

[answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be deleted, as
you can see in DStream.scala. While "spark.cleaner.ttl" is timer-based spark cleaner, not
only clean streaming data, but also broadcast, shuffle and other data.

Thank you!

View raw message