spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject Re: How do you write a JavaRDD into a single file
Date Mon, 20 Oct 2014 22:53:54 GMT
Sorry I missed the discussion - although it did not answer the question -
In my case (and I suspect the askers) the 100 slaves are doing a lot of
useful work but the generated output is small enough to be handled by a
single process.
Many of the large data problems I have worked process a lot of data but end
up with a single report file - frequently in a format specified by
preexisting downstream code.
  I do not want a separate  hadoop merge step for a lot of reasons starting
with
better control of the generation of the file.
However toLocalIterator is exactly what I need.
Somewhat off topic - I am being overwhelmed by getting a lot of emails from
the list - is there s way to get a daily summary which might be a lot
easier to keep up with


On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <sowen@cloudera.com> wrote:

> This was covered a few days ago:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
>
> The multiple output files is actually essential for parallelism, and
> certainly not a bad idea. You don't want 100 distributed workers
> writing to 1 file in 1 place, not if you want it to be fast.
>
> RDD and  JavaRDD already expose a method to iterate over the data,
> called toLocalIterator. It does not require that the RDD fit entirely
> in memory.
>
> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2000@gmail.com>
> wrote:
> >   At the end of a set of computation I have a JavaRDD<String> . I want a
> > single file where each string is printed in order. The data is small
> enough
> > that it is acceptable to handle the printout on a single processor. It
> may
> > be large enough that using collect to generate a list might be
> unacceptable.
> > the saveAsText command creates multiple files with names like part0000,
> > part0001 .... This was bed behavior in Hadoop for final output and is
> also
> > bad for Spark.
> >   A more general issue is whether is it possible to convert a JavaRDD
> into
> > an iterator or iterable over then entire data set without using collect
> or
> > holding all data in memory.
> >    In many problems where it is desirable to parallelize intermediate
> steps
> > but use a single process for handling the final result this could be very
> > useful.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Mime
View raw message