spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Ganelin <ilgan...@gmail.com>
Subject Re: How do you write a JavaRDD into a single file
Date Tue, 21 Oct 2014 03:05:25 GMT
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-00000 containing all the data.

-Ilya Ganelin

On Mon, Oct 20, 2014 at 11:01 PM, jay vyas <jayunit100.apache@gmail.com>
wrote:

> sounds more like a use case for using "collect"... and writing out the
> file in your program?
>
> On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis <lordjoe2000@gmail.com>
> wrote:
>
>> Sorry I missed the discussion - although it did not answer the question -
>> In my case (and I suspect the askers) the 100 slaves are doing a lot of
>> useful work but the generated output is small enough to be handled by a
>> single process.
>> Many of the large data problems I have worked process a lot of data but
>> end up with a single report file - frequently in a format specified by
>> preexisting downstream code.
>>   I do not want a separate  hadoop merge step for a lot of reasons
>> starting with
>> better control of the generation of the file.
>> However toLocalIterator is exactly what I need.
>> Somewhat off topic - I am being overwhelmed by getting a lot of emails
>> from the list - is there s way to get a daily summary which might be a lot
>> easier to keep up with
>>
>>
>> On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> This was covered a few days ago:
>>>
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
>>>
>>> The multiple output files is actually essential for parallelism, and
>>> certainly not a bad idea. You don't want 100 distributed workers
>>> writing to 1 file in 1 place, not if you want it to be fast.
>>>
>>> RDD and  JavaRDD already expose a method to iterate over the data,
>>> called toLocalIterator. It does not require that the RDD fit entirely
>>> in memory.
>>>
>>> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis <lordjoe2000@gmail.com>
>>> wrote:
>>> >   At the end of a set of computation I have a JavaRDD<String> . I
want
>>> a
>>> > single file where each string is printed in order. The data is small
>>> enough
>>> > that it is acceptable to handle the printout on a single processor. It
>>> may
>>> > be large enough that using collect to generate a list might be
>>> unacceptable.
>>> > the saveAsText command creates multiple files with names like part0000,
>>> > part0001 .... This was bed behavior in Hadoop for final output and is
>>> also
>>> > bad for Spark.
>>> >   A more general issue is whether is it possible to convert a JavaRDD
>>> into
>>> > an iterator or iterable over then entire data set without using
>>> collect or
>>> > holding all data in memory.
>>> >    In many problems where it is desirable to parallelize intermediate
>>> steps
>>> > but use a single process for handling the final result this could be
>>> very
>>> > useful.
>>>
>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>
>
> --
> jay vyas
>

Mime
View raw message