spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed
Date Sun, 29 Mar 2015 15:50:44 GMT
Nathan:
Please look in log files for any of the following:
doCleanupRDD():
      case e: Exception => logError("Error cleaning RDD " + rddId, e)
doCleanupShuffle():
      case e: Exception => logError("Error cleaning shuffle " + shuffleId,
e)
doCleanupBroadcast():
      case e: Exception => logError("Error cleaning broadcast " +
broadcastId, e)

Cheers

On Sun, Mar 29, 2015 at 7:55 AM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> Try these:
>
> - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
> - Enable log rotation:
>
> sparkConf.set("spark.executor.logs.rolling.strategy", "size")
> .set("spark.executor.logs.rolling.size.maxBytes", "1024")
> .set("spark.executor.logs.rolling.maxRetainedFiles", "3")
>
>
> Also see, whats really getting filled on disk.
>
> Thanks
> Best Regards
>
> On Sat, Mar 28, 2015 at 8:18 PM, Nathan Marin <nathan.marin@teads.tv>
> wrote:
>
>> Hi,
>>
>> I’ve been trying to use Spark Streaming for my real-time analysis
>> application using the Kafka Stream API on a cluster (using the yarn
>> version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
>> RAM.
>>
>> The thing is, my application should run 24/7 but the disk usage is
>> leaking. This leads to some exceptions occurring when Spark tries to
>> write on a file system where no space is left.
>>
>> Here are some graphs showing the disk space remaining on a node where
>> my application is deployed:
>> http://i.imgur.com/vdPXCP0.png
>> The "drops" occurred on a 3 minute interval.
>>
>> The Disk Usage goes back to normal once I kill my application:
>> http://i.imgur.com/ERZs2Cj.png
>>
>> The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
>> when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
>> shouldn't even allow spark to write on disk, right?).
>>
>> My question is: How can I force Spark (Streaming?) to remove whatever
>> he stores immediately after he processed-it? Obviously it doesn’t look
>> like the disk is being cleaned up (even though the memory does) even
>> with me calling the rdd.unpersist() method foreach RDD processed.
>>
>> Here’s a sample of my application code:
>> http://pastebin.com/K86LE1J6
>>
>> Maybe something is wrong in my app too?
>>
>> Thanks for your help,
>> NM
>>
>> ------------------------------
>> View this message in context: [Spark Streaming] Disk not being cleaned
>> up during runtime after RDD being processed
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22271.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Mime
View raw message