spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Save RDDs as CSV
Date Thu, 31 Oct 2013 04:07:29 GMT
 You can do this if you coalesce the data first. However, this will
put all of your final data through a single reduce tasks (so you get
no parallelism and may overload a node):

myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")

Basically you have to chose, either you do the write in parallel and
get a lot of files, or you do the write on one node/reducer and get a
single file.

- Patrick

On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <shay@1618labs.com> wrote:
> Well that almost works... when I call
> myrdd.saveAsTextFile("hdfs://..../my.csv")
>
> Instead of getting a single my.csv file, as I expect, my.csv is a directory
> with a bunch parts - all of which are csv.
> Is there some way have those files concatenated automatically?
>
>
>
>
> On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <rosenville@gmail.com> wrote:
>>
>> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>> which writes one record per line:
>> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>>
>> You could map() each entry in your RDD into a comma-separated string, then
>> write those strings using saveAsTextFile().
>>
>>
>>
>>
>> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>> <schumach@icsi.berkeley.edu> wrote:
>>>
>>>
>>> Hi,
>>>
>>> Can you use saveAsTextFile? See
>>>
>>>
>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>>>
>>> I'm not sure what the default field separator is (Tab probably) but if
>>> you don't mind that may work? No need to collect it to the master.
>>>
>>> Andre
>>>
>>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>>> > Do I have to collect the RDD and save it from the master, or is there
>>> > someway I can write out the CSV file in parallel to HDFS?
>>> >
>>> >
>>> > tks
>>> > shay
>>> >
>>>
>>
>

Mime
View raw message