spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Feature request: split dataset based on condition
Date Sun, 03 Feb 2019 14:41:00 GMT
I don't think Spark supports this model, where N inputs depending on parent
are computed once at the same time. You can of course cache the parent and
filter N times and do the same amount of work. One problem is, where would
the N inputs live? they'd have to be stored if not used immediately, and
presumably in any use case, only one of them would be used immediately. If
you have a job that needs to split records of a parent into N subsets, and
then all N subsets are used, you can do that -- you are just transforming
the parent to one child that has rows with those N splits of each input
row, and then consume that. See randomSplit() for maybe the best case,
where it still produce N Datasets but can do so efficiently because it's
just a random sample.

On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <moein7tl@gmail.com> wrote:

> I don't consider it as method to apply filtering multiple time, instead
> use it as semi-action not just transformation. Let's think that we have
> something like map-partition which accept multiple lambda that each one
> collect their ROW for their dataset (or something like it). Is it possible?
>
> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <srowen@gmail.com> wrote:
>
>> I think the problem is that can't produce multiple Datasets from one
>> source in one operation - consider that reproducing one of them would mean
>> reproducing all of them. You can write a method that would do the filtering
>> multiple times but it wouldn't be faster. What do you have in mind that's
>> different?
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <moein7tl@gmail.com>
>> wrote:
>>
>>> I've seen many application need to split dataset to multiple datasets
>>> based on some conditions. As there is no method to do it in one place,
>>> developers use *filter *method multiple times. I think it can be useful
>>> to have method to split dataset based on condition in one iteration,
>>> something like *partition* method of scala (of-course scala partition
>>> just split list into two list, but something more general can be more
>>> useful).
>>> If you think it can be helpful, I can create Jira issue and work on it
>>> to send PR.
>>>
>>> Best Regards
>>> Moein
>>>
>>> --
>>>
>>> Moein Hosseini
>>> Data Engineer
>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>> site: www.moein.xyz
>>> email: moein7tl@gmail.com
>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>>> [image: twitter] <https://twitter.com/moein7tl>
>>>
>>>
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859 <+98+912+468+1859>
> site: www.moein.xyz
> email: moein7tl@gmail.com
> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
> [image: twitter] <https://twitter.com/moein7tl>
>
>

Mime
View raw message