spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Feature request: split dataset based on condition
Date Sun, 03 Feb 2019 14:41:00 GMT
I don't think Spark supports this model, where N inputs depending on parent
are computed once at the same time. You can of course cache the parent and
filter N times and do the same amount of work. One problem is, where would
the N inputs live? they'd have to be stored if not used immediately, and
presumably in any use case, only one of them would be used immediately. If
you have a job that needs to split records of a parent into N subsets, and
then all N subsets are used, you can do that -- you are just transforming
the parent to one child that has rows with those N splits of each input
row, and then consume that. See randomSplit() for maybe the best case,
where it still produce N Datasets but can do so efficiently because it's
just a random sample.

On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <> wrote:

> I don't consider it as method to apply filtering multiple time, instead
> use it as semi-action not just transformation. Let's think that we have
> something like map-partition which accept multiple lambda that each one
> collect their ROW for their dataset (or something like it). Is it possible?
> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <> wrote:
>> I think the problem is that can't produce multiple Datasets from one
>> source in one operation - consider that reproducing one of them would mean
>> reproducing all of them. You can write a method that would do the filtering
>> multiple times but it wouldn't be faster. What do you have in mind that's
>> different?
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <>
>> wrote:
>>> I've seen many application need to split dataset to multiple datasets
>>> based on some conditions. As there is no method to do it in one place,
>>> developers use *filter *method multiple times. I think it can be useful
>>> to have method to split dataset based on condition in one iteration,
>>> something like *partition* method of scala (of-course scala partition
>>> just split list into two list, but something more general can be more
>>> useful).
>>> If you think it can be helpful, I can create Jira issue and work on it
>>> to send PR.
>>> Best Regards
>>> Moein
>>> --
>>> Moein Hosseini
>>> Data Engineer
>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>> site:
>>> email:
>>> [image: linkedin] <>
>>> [image: twitter] <>
> --
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859 <+98+912+468+1859>
> site:
> email:
> [image: linkedin] <>
> [image: twitter] <>

View raw message