spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: OOM with groupBy + saveAsTextFile
Date Mon, 03 Nov 2014 10:12:56 GMT
Yes, that's the same thing really. You're still writing a huge value
as part of one single (key,value) record. The value exists in memory
in order to be written to storage. Although there aren't hard limits,
in general, keys and values aren't intended to be huge, like, hundreds
of megabytes.

You should probably design this differently, to not try to collect a
massive value per key. That is a generally good idea, not just for
this reason.

Certainly, you don't have to be able to fit many (key,value) in memory
at once. One, yes.

On Mon, Nov 3, 2014 at 10:08 AM, Bharath Ravi Kumar <> wrote:
> The result was no different with saveAsHadoopFile. In both cases, I can see
> that I've misinterpreted the API docs. I'll explore the API's a bit further
> for ways to save the iterable as chunks rather than one large text/binary.
> It might also help to clarify this aspect in the API docs. For those (like
> me) whose first practical experience with data processing is through spark,
> having skipped the Hadoop MR ecosystem, it might help to clarify
> interactions with HDFS and the likes. Thanks for all the help.
> On Sun, Nov 2, 2014 at 10:22 PM, Sean Owen <> wrote:
>> saveAsText means "save every element of the RDD as one line of text".
>> It works like TextOutputFormat in Hadoop MapReduce since that's what
>> it uses. So you are causing it to create one big string out of each
>> Iterable this way.
>> On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar <>
>> wrote:
>> > Thanks for responding. This is what I initially suspected, and hence
>> > asked
>> > why the library needed to construct the entire value buffer on a single
>> > host
>> > before writing it out. The stacktrace appeared to suggest that user code
>> > is
>> > not constructing the large buffer. I'm simply calling groupBy and
>> > saveAsText
>> > on the resulting grouped rdd. The value after grouping is an
>> > Iterable<Tuple4<String, Double, String, String>>. None of the strings
>> > are
>> > large. I also do not need a single large string created out of the
>> > Iterable
>> > for writing to disk. Instead, I expect the iterable to get written out
>> > in
>> > chunks in response to saveAsText. This shouldn't be the default
>> > behaviour of
>> > saveAsText perhaps? Hence my original question of the behavior of
>> > saveAsText. The tuning / partitioning attempts were aimed at reducing
>> > memory
>> > pressure so that multiple such buffers aren't constructed at the same
>> > time
>> > on a host. I'll take a second look at the data and code before updating
>> > this
>> > thread. Thanks.
>> >

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message