spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bharath Ravi Kumar <>
Subject Re: OOM with groupBy + saveAsTextFile
Date Mon, 03 Nov 2014 10:13:14 GMT
I also realized from your description of saveAsText that the API is indeed
behaving as expected i.e. it is appropriate (though not optimal) for the
API to construct a single string out of the value. If the value turns out
to be large, the user of the API needs to reconsider the implementation
approach. My bad.

On Mon, Nov 3, 2014 at 3:38 PM, Bharath Ravi Kumar <>

> The result was no different with saveAsHadoopFile. In both cases, I can
> see that I've misinterpreted the API docs. I'll explore the API's a bit
> further for ways to save the iterable as chunks rather than one large
> text/binary. It might also help to clarify this aspect in the API docs. For
> those (like me) whose first practical experience with data processing is
> through spark, having skipped the Hadoop MR ecosystem, it might help to
> clarify interactions with HDFS and the likes. Thanks for all the help.
> On Sun, Nov 2, 2014 at 10:22 PM, Sean Owen <> wrote:
>> saveAsText means "save every element of the RDD as one line of text".
>> It works like TextOutputFormat in Hadoop MapReduce since that's what
>> it uses. So you are causing it to create one big string out of each
>> Iterable this way.
>> On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar <>
>> wrote:
>> > Thanks for responding. This is what I initially suspected, and hence
>> asked
>> > why the library needed to construct the entire value buffer on a single
>> host
>> > before writing it out. The stacktrace appeared to suggest that user
>> code is
>> > not constructing the large buffer. I'm simply calling groupBy and
>> saveAsText
>> > on the resulting grouped rdd. The value after grouping is an
>> > Iterable<Tuple4<String, Double, String, String>>. None of the strings
>> are
>> > large. I also do not need a single large string created out of the
>> Iterable
>> > for writing to disk. Instead, I expect the iterable to get written out
>> in
>> > chunks in response to saveAsText. This shouldn't be the default
>> behaviour of
>> > saveAsText perhaps? Hence my original question of the behavior of
>> > saveAsText. The tuning / partitioning attempts were aimed at reducing
>> memory
>> > pressure so that multiple such buffers aren't constructed at the same
>> time
>> > on a host. I'll take a second look at the data and code before updating
>> this
>> > thread. Thanks.
>> >

View raw message