spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: How to write a RDD into One Local Existing File?
Date Mon, 20 Oct 2014 08:17:39 GMT
If you don't need part-xxx files in the output but 1 file, then you should
repartition (or coalesce) the RDD into 1 (This will be bottleneck since you
are disabling the parallelism - its like giving everything to 1 machine to
process). You are better off merging those part-xxx files afterwards spark
in hdfs (use hadoop fs -getmerge)

Thanks
Best Regards

On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav <rishi@infoobjects.com> wrote:

> Write to hdfs and then get one file locally bu using "hdfs dfs
> -getmerge..."
>
>
> On Friday, October 17, 2014, Sean Owen <sowen@cloudera.com> wrote:
>
>> You can save to a local file. What are you trying and what doesn't work?
>>
>> You can output one file by repartitioning to 1 partition but this is
>> probably not a good idea as you are bottlenecking the output and some
>> upstream computation by disabling parallelism.
>>
>> How about just combining the files on HDFS afterwards? or just reading
>> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
>> once.
>>
>> On Fri, Oct 17, 2014 at 6:46 PM, Parthus <peng.wei.prc@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a spark mapreduce task which requires me to write the final rdd
>> to an
>> > existing local file (appending to this file). I tried two ways but
>> neither
>> > works well:
>> >
>> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
>> to
>> > local, but I never make it work. Moreover, the result is not one file
>> but a
>> > series of part-xxxxx files which is not what I hope to get.
>> >
>> > 2. collect the rdd to an array and write it to the driver node using
>> Java's
>> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
>> cannot
>> > fit into the memory of one driver node. I have to split the task into
>> small
>> > pieces and collect them part by part and write; 2) During the writing by
>> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
>> >
>> > Could anybody provide me an efficient way to solve this problem? I wish
>> that
>> > the solution could be like: appending a huge rdd to a local file without
>> > pausing the MapReduce during writing?
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
> --
> - Rishi
>

Mime
View raw message