spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Ellrott <kellr...@soe.ucsc.edu>
Subject SPARK-942
Date Sat, 26 Oct 2013 17:53:18 GMT
I was wondering if anybody had any thoughts on the best way to tackle
SPARK-942 ( https://spark-project.atlassian.net/browse/SPARK-942 ).
Basically, Spark takes an iterator from a flatmap call and because I tell
it that it needs to persist Spark proceeds to push it all into an array
before deciding that it doesn't have enough memory and trying to serialize
it to disk, and somewhere along the line it runs out of memory. For my
particular operation, the function return an iterator that reads data out
of a file, and the size of the files passed to that function can vary
greatly (from a few kilobytes to a few gigabytes). The funny thing is that
if I do a strait 'map' operation after the flat map, everything works,
because Spark just passes the iterator forward and never tries to expand
the whole thing into memory. But I need do a reduceByKey across all the
records, so I'd like to persist to disk first, and that is where I hit this
snag.
I've already setup a unit test to replicate the problem, and I know the
area of the code that would need to be fixed.
I'm just hoping for some tips on the best way to fix the problem.

Kyle

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message