spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: How to make Spark merge the output file?
Date Tue, 07 Jan 2014 17:37:39 GMT
HDFS, since 0.21 <https://issues.apache.org/jira/browse/HDFS-222>, has a
concat() method which would do exactly this, but I am not sure of the
performance implications. Of course, as Matei pointed out, it's unusual to
actually need a single HDFS file.


On Mon, Jan 6, 2014 at 9:08 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Unfortunately this is expensive to do on HDFS — you’d need a single writer
> to write the whole file. If your file is small enough for that, you can use
> coalesce() on the RDD to bring all the data to one node, and then save it.
> However most HDFS applications work with directories containing multiple
> files instead of single files for this reason.
>
> Matei
>
> On Jan 6, 2014, at 10:56 PM, Nan Zhu <zhunanmcgill@gmail.com> wrote:
>
> > Hi, all
> >
> > maybe a stupid question, but is there any way to make Spark write a
> single file instead of partitioned files?
> >
> > Best,
> >
> > --
> > Nan Zhu
> >
>
>

Mime
View raw message