spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Marin <>
Subject [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed
Date Sat, 28 Mar 2015 14:48:37 GMT

I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn
version) of 6 executors with 4 dedicated cores and 8192mb of dedicated

The thing is, my application should run 24/7 but the disk usage is
leaking. This leads to some exceptions occurring when Spark tries to
write on a file system where no space is left.

Here are some graphs showing the disk space remaining on a node where
my application is deployed:
The "drops" occurred on a 3 minute interval.

The Disk Usage goes back to normal once I kill my application:

The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
shouldn't even allow spark to write on disk, right?).

My question is: How can I force Spark (Streaming?) to remove whatever
he stores immediately after he processed-it? Obviously it doesn’t look
like the disk is being cleaned up (even though the memory does) even
with me calling the rdd.unpersist() method foreach RDD processed.

Here’s a sample of my application code:

Maybe something is wrong in my app too?

Thanks for your help,

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message