spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aureliano Buendia <buendia...@gmail.com>
Subject Re: How to save as a single file efficiently?
Date Sat, 22 Mar 2014 00:01:01 GMT
Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().


On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Try passing the shuffle=true parameter to coalesce, then it will do the
> map in parallel but still pass all the data through one reduce node for
> writing it out. That's probably the fastest it will get. No need to cache
> if you do that.
>
> Matei
>
> On Mar 21, 2014, at 4:04 PM, Aureliano Buendia <buendia360@gmail.com>
> wrote:
>
> > Hi,
> >
> > Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
> found that a partition number of 1000 is a good number to speed the process
> up. However, it does not make sense to have 1000 pieces of csv files each
> less than 1 kb.
> >
> > We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow,
> and we are not properly using our resources this way. So this is very slow:
> >
> > rdd.map(...).coalesce(1).saveAsTextFile()
> >
> > How is it possible to use coalesce(1) simply for concatenating the
> materialized output text files? Would something like this make sense?:
> >
> > rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile()
> >
> > Or, would something like this achieve it?:
> >
> > rdd.map(...).cache().coalesce(1).saveAsTextFile()
>
>

Mime
View raw message