spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@apache.org>
Subject Re: SPARK-942
Date Sun, 03 Nov 2013 07:50:49 GMT
It's not a very elegant solution, but one possibility is for the
CacheManager to check whether it will have enough space. If it is running
out of space, skips buffering the output of the iterator & directly write
the output of the iterator to disk (if storage level allows that).

But it is still tricky to know whether we will run out of space before we
even start running the iterator. One possibility is to use sizing data from
previous partitions to estimate the size of the current partition (i.e.
estimated in memory size = avg of current in-memory size / current input
size).

Do you have any ideas on this one, Kyle?


On Sat, Oct 26, 2013 at 10:53 AM, Kyle Ellrott <kellrott@soe.ucsc.edu>wrote:

> I was wondering if anybody had any thoughts on the best way to tackle
> SPARK-942 ( https://spark-project.atlassian.net/browse/SPARK-942 ).
> Basically, Spark takes an iterator from a flatmap call and because I tell
> it that it needs to persist Spark proceeds to push it all into an array
> before deciding that it doesn't have enough memory and trying to serialize
> it to disk, and somewhere along the line it runs out of memory. For my
> particular operation, the function return an iterator that reads data out
> of a file, and the size of the files passed to that function can vary
> greatly (from a few kilobytes to a few gigabytes). The funny thing is that
> if I do a strait 'map' operation after the flat map, everything works,
> because Spark just passes the iterator forward and never tries to expand
> the whole thing into memory. But I need do a reduceByKey across all the
> records, so I'd like to persist to disk first, and that is where I hit this
> snag.
> I've already setup a unit test to replicate the problem, and I know the
> area of the code that would need to be fixed.
> I'm just hoping for some tips on the best way to fix the problem.
>
> Kyle
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message