And to answer your original question, spark.cleaner.ttl is not safe for the exact reason you brought up. The PR Mark linked intends to provide a much cleaner (and safer) solution.On Tue, Mar 11, 2014 at 2:01 PM, Mark Hamstra <firstname.lastname@example.org> wrote:Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman <email@example.com> wrote:Hello,
I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by available disk space.
I found a thread on the usage of spark.cleaner.ttl on the old Spark Users Google group here:
I think this setting may be what I'm looking for, however the cleaner seems to delete data that's still in use. The effect is I get bizarre exceptions from Spark complaining about missing broadcast data or ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it supposed to delete in-use data or is this a bug/shortcoming?
Senior Software Engineer
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·