spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nezih Yigitbasi <nyigitb...@netflix.com.INVALID>
Subject Re: how about a custom coalesce() policy?
Date Sun, 03 Apr 2016 06:27:13 GMT
Sure, here <https://issues.apache.org/jira/browse/SPARK-14042> is the jira
and this <https://github.com/apache/spark/pull/11865> is the PR.

Nezih

On Sat, Apr 2, 2016 at 10:40 PM Hemant Bhanawat <hemant9379@gmail.com>
wrote:

> correcting email id for Nezih
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <hemant9379@gmail.com>
> wrote:
>
>> Hi Nezih,
>>
>> Can you share JIRA and PR numbers?
>>
>> This partial de-coupling of data partitioning strategy and spark
>> parallelism would be a useful feature for any data store.
>>
>> Hemant
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>> www.snappydata.io
>>
>> On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <
>> nyigitbasi@netflix.com.invalid> wrote:
>>
>>> Hey Reynold,
>>> Created an issue (and a PR) for this change to get discussions started.
>>>
>>> Thanks,
>>> Nezih
>>>
>>> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> Using the right email for Nezih
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <rxin@databricks.com>
>>>> wrote:
>>>>
>>>>> I think this can be useful.
>>>>>
>>>>> The only thing is that we are slowly migrating to the
>>>>> Dataset/DataFrame API, and leave RDD mostly as is as a lower level API.
>>>>> Maybe we should do both? In either case it would be great to discuss
the
>>>>> API on a pull request. Cheers.
>>>>>
>>>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>>>>> nyigitbasi@netflix.com.invalid> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I have sent an email about my problem some time ago where I want
to
>>>>>> merge a large number of small files with Spark. Currently I am using
Hive
>>>>>> with the CombineHiveInputFormat and I can control the size of the
>>>>>> output files with the max split size parameter (which is used for
>>>>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>>>>> attempt was to use coalesce(), but since coalesce only considers
the
>>>>>> target number of partitions the output file sizes were varying wildly.
>>>>>>
>>>>>> What I think can be useful is to have an optional PartitionCoalescer
>>>>>> parameter (a new interface) in the coalesce() method (or maybe we
>>>>>> can add a new method ?) that the callers can implement for custom
>>>>>> coalescing strategies — for my use case I have already implemented
a
>>>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking
at
>>>>>> their sizes and by using a max split size parameter, similar to the
>>>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>>>>>> to the individual split sizes etc.).
>>>>>>
>>>>>> What do you guys think about such a change, can it be useful to other
>>>>>> users as well? Or do you think that there is an easier way to accomplish
>>>>>> the same merge logic? If you think it may be useful, I already have
>>>>>> an implementation and I will be happy to work with the community
to
>>>>>> contribute it.
>>>>>>
>>>>>> Thanks,
>>>>>> Nezih
>>>>>> ​
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message