Hey Reynold,Created an issue (and a PR) for this change to get discussions started.Thanks,NezihOn Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <firstname.lastname@example.org> wrote:Using the right email for NezihOn Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <email@example.com> wrote:I think this can be useful.The only thing is that we are slowly migrating to the Dataset/DataFrame API, and leave RDD mostly as is as a lower level API. Maybe we should do both? In either case it would be great to discuss the API on a pull request. Cheers.On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <firstname.lastname@example.org> wrote:
Hi Spark devs,
I have sent an email about my problem some time ago where I want to merge a large number of small files with Spark. Currently I am using Hive with the
CombineHiveInputFormatand I can control the size of the output files with the max split size parameter (which is used for coalescing the input splits by the
CombineHiveInputFormat). My first attempt was to use
coalesce(), but since coalesce only considers the target number of partitions the output file sizes were varying wildly.
What I think can be useful is to have an optional
PartitionCoalescerparameter (a new interface) in the
coalesce()method (or maybe we can add a new method ?) that the callers can implement for custom coalescing strategies — for my use case I have already implemented a
SizeBasedPartitionCoalescerthat coalesces partitions by looking at their sizes and by using a max split size parameter, similar to the
CombineHiveInputFormat(I also had to expose
HadoopRDDto get access to the individual split sizes etc.).
What do you guys think about such a change, can it be useful to other users as well? Or do you think that there is an easier way to accomplish the same
mergelogic? If you think it may be useful, I already have an implementation and I will be happy to work with the community to contribute it.