spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: OOM with groupBy + saveAsTextFile
Date Sun, 02 Nov 2014 16:52:39 GMT
saveAsText means "save every element of the RDD as one line of text".
It works like TextOutputFormat in Hadoop MapReduce since that's what
it uses. So you are causing it to create one big string out of each
Iterable this way.

On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar <reachbach@gmail.com> wrote:
> Thanks for responding. This is what I initially suspected, and hence asked
> why the library needed to construct the entire value buffer on a single host
> before writing it out. The stacktrace appeared to suggest that user code is
> not constructing the large buffer. I'm simply calling groupBy and saveAsText
> on the resulting grouped rdd. The value after grouping is an
> Iterable<Tuple4<String, Double, String, String>>. None of the strings are
> large. I also do not need a single large string created out of the Iterable
> for writing to disk. Instead, I expect the iterable to get written out in
> chunks in response to saveAsText. This shouldn't be the default behaviour of
> saveAsText perhaps? Hence my original question of the behavior of
> saveAsText. The tuning / partitioning attempts were aimed at reducing memory
> pressure so that multiple such buffers aren't constructed at the same time
> on a host. I'll take a second look at the data and code before updating this
> thread. Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message