spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: How do you write a JavaRDD into a single file
Date Mon, 20 Oct 2014 22:23:17 GMT
This was covered a few days ago:

The multiple output files is actually essential for parallelism, and
certainly not a bad idea. You don't want 100 distributed workers
writing to 1 file in 1 place, not if you want it to be fast.

RDD and  JavaRDD already expose a method to iterate over the data,
called toLocalIterator. It does not require that the RDD fit entirely
in memory.

On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <> wrote:
>   At the end of a set of computation I have a JavaRDD<String> . I want a
> single file where each string is printed in order. The data is small enough
> that it is acceptable to handle the printout on a single processor. It may
> be large enough that using collect to generate a list might be unacceptable.
> the saveAsText command creates multiple files with names like part0000,
> part0001 .... This was bed behavior in Hadoop for final output and is also
> bad for Spark.
>   A more general issue is whether is it possible to convert a JavaRDD into
> an iterator or iterable over then entire data set without using collect or
> holding all data in memory.
>    In many problems where it is desirable to parallelize intermediate steps
> but use a single process for handling the final result this could be very
> useful.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message