spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <>
Subject Re: Feature request: split dataset based on condition
Date Mon, 04 Feb 2019 21:21:40 GMT
Just wondering if this is what you are implying Ryan (example only):

val data = (dataset to be partitionned)

val splitCondition =
           WHEN …. THEN ….
           WHEN …. THEN …..
  END partition_condition
val partitionedData = data.withColumn("partitionColumn", expr(splitCondition))

In this case there might be a need to cache/persist the partitionedData dataset to avoid recomputation
as each "partition" is processed (e.g. saved, etc.) later on, correct?

From: Ryan Blue <>
Reply-To: <>
Date: Monday, February 4, 2019 at 12:16 PM
To: Andrew Melo <>
Cc: Moein Hosseini <>, dev <>
Subject: Re: Feature request: split dataset based on condition

To partition by a condition, you would need to create a column with the result of that condition.
Then you would partition by that column. The sort option would also work here.

I don't think that there is much of a use case for this. You have a set of conditions on which
to partition your data, and partitioning is already supported. The idea to use conditions
to create separate data frames would actually make that harder because you'd need to create
and name tables for each one.

On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo <<>>
Hello Ryan,

On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <<>>
> Andrew, can you give us more information about why partitioning the output data doesn't
work for your use case?
> It sounds like all you need to do is to create a table partitioned by A and B, then you
would automatically get the divisions you want. If what you're looking for is a way to scale
the number of combinations then you can use formats that support more partitions, or you could
sort by the fields and rely on Parquet row group pruning to filter out data you don't want.

TBH, I don't understand what that would look like in pyspark and what
the consequences would be. Looking at the docs, it doesn't appear to
be the syntax for partitioning on a condition (most of our conditions
are of the form 'X > 30'). The use of Spark is still somewhat new in
our field, so it's possible we're not using it correctly.


> rb
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <<>>
>> Hello
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <<>>
>> >
>> > I've seen many application need to split dataset to multiple datasets based
on some conditions. As there is no method to do it in one place, developers use filter method
multiple times. I think it can be useful to have method to split dataset based on condition
in one iteration, something like partition method of scala (of-course scala partition just
split list into two list, but something more general can be more useful).
>> > If you think it can be helpful, I can create Jira issue and work on it to send
>> This would be a really useful feature for our use case (processing
>> collision data from the LHC). We typically want to take some sort of
>> input and split into multiple disjoint outputs based on some
>> conditions. E.g. if we have two conditions A and B, we'll end up with
>> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> combinatorics explode like n^2, when we could produce them all up
>> front with this "multi filter" (or however it would be called).
>> Cheers
>> Andrew
>> >
>> > Best Regards
>> > Moein
>> >
>> > --
>> >
>> > Moein Hosseini
>> > Data Engineer
>> > mobile: +98 912 468 1859
>> > site:<>
>> > email:<>
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:<>
> --
> Ryan Blue
> Software Engineer
> Netflix

Ryan Blue
Software Engineer
View raw message