spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Miller <>
Subject Re: Best way to merge files from streaming jobs‏ on S3
Date Sat, 05 Mar 2016 05:03:19 GMT
Why does the order matter? Coalesce runs in parallel and if it's just
writing to the file, then I imagine it would do it in whatever order it
happens to be executed in each thread. If you want to sort the resulting
data, I imagine you'd need to save it to some sort of data structure
instead of writing to the file from coalesce, sort that data structure,
then write your file.

Chris Miller

On Sat, Mar 5, 2016 at 5:24 AM, jelez <> wrote:

> My streaming job is creating files on S3.
> The problem is that those files end up very small if I just write them to
> S3
> directly.
> This is why I use coalesce() to reduce the number of files and make them
> larger.
> However, coalesce shuffles data and my job processing time ends up higher
> than sparkBatchIntervalMilliseconds.
> I have observed that if I coalesce the number of partitions to be equal to
> the cores in the cluster I get less shuffling - but that is
> unsubstantiated.
> Is there any dependency/rule between number of executors, number of cores
> etc. that I can use to minimize shuffling and at the same time achieve
> minimum number of output files per batch?
> What is the best practice?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message