spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <>
Subject Re: how about a custom coalesce() policy?
Date Sun, 03 Apr 2016 05:39:00 GMT
Hi Nezih,

Can you share JIRA and PR numbers?

This partial de-coupling of data partitioning strategy and spark
parallelism would be a useful feature for any data store.


Hemant Bhanawat <>

On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <> wrote:

> Hey Reynold,
> Created an issue (and a PR) for this change to get discussions started.
> Thanks,
> Nezih
> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <> wrote:
>> Using the right email for Nezih
>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <>
>> wrote:
>>> I think this can be useful.
>>> The only thing is that we are slowly migrating to the Dataset/DataFrame
>>> API, and leave RDD mostly as is as a lower level API. Maybe we should do
>>> both? In either case it would be great to discuss the API on a pull
>>> request. Cheers.
>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>>>> wrote:
>>>> Hi Spark devs,
>>>> I have sent an email about my problem some time ago where I want to
>>>> merge a large number of small files with Spark. Currently I am using Hive
>>>> with the CombineHiveInputFormat and I can control the size of the
>>>> output files with the max split size parameter (which is used for
>>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>>> attempt was to use coalesce(), but since coalesce only considers the
>>>> target number of partitions the output file sizes were varying wildly.
>>>> What I think can be useful is to have an optional PartitionCoalescer
>>>> parameter (a new interface) in the coalesce() method (or maybe we can
>>>> add a new method ?) that the callers can implement for custom coalescing
>>>> strategies — for my use case I have already implemented a
>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>>> their sizes and by using a max split size parameter, similar to the
>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>>>> to the individual split sizes etc.).
>>>> What do you guys think about such a change, can it be useful to other
>>>> users as well? Or do you think that there is an easier way to accomplish
>>>> the same merge logic? If you think it may be useful, I already have an
>>>> implementation and I will be happy to work with the community to contribute
>>>> it.
>>>> Thanks,
>>>> Nezih
>>>> ​

View raw message