spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Miller <cmiller11...@gmail.com>
Subject Re: Best way to merge files from streaming jobs‏ on S3
Date Sat, 05 Mar 2016 05:03:19 GMT
Why does the order matter? Coalesce runs in parallel and if it's just
writing to the file, then I imagine it would do it in whatever order it
happens to be executed in each thread. If you want to sort the resulting
data, I imagine you'd need to save it to some sort of data structure
instead of writing to the file from coalesce, sort that data structure,
then write your file.


--
Chris Miller

On Sat, Mar 5, 2016 at 5:24 AM, jelez <jelez@hotmail.com> wrote:

> My streaming job is creating files on S3.
> The problem is that those files end up very small if I just write them to
> S3
> directly.
> This is why I use coalesce() to reduce the number of files and make them
> larger.
>
> However, coalesce shuffles data and my job processing time ends up higher
> than sparkBatchIntervalMilliseconds.
>
> I have observed that if I coalesce the number of partitions to be equal to
> the cores in the cluster I get less shuffling - but that is
> unsubstantiated.
> Is there any dependency/rule between number of executors, number of cores
> etc. that I can use to minimize shuffling and at the same time achieve
> minimum number of output files per batch?
> What is the best practice?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message