spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Minack <m...@Enrico.Minack.dev>
Subject Re: Need to order iterator values in spark dataframe
Date Thu, 26 Mar 2020 17:24:31 GMT
Abhinav,

you can repartition by your key, then sortWithinPartition, and the 
groupByKey. Since data are already hash-partitioned by key, Spark should 
not shuffle the data hence change the sort wihtin each partition:

ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key")

Enrico


Am 26.03.20 um 17:53 schrieb Ranjan, Abhinav:
>
> Hi,
>
> I have a dataframe which has data like:
>
> key                         |    code    |    code_value
> 1                            |    c1        |    11
> 1                            |    c2        |    12
> 1                            |    c2        |    9
> 1                            |    c3        |    12
> 1                            |    c2        |    13
> 1                            |    c2        |    14
> 1                            |    c4        |    12
> 1                            |    c2        |    15
> 1                            |    c1        |    12
>
>
> I need to group the data based on key and then apply some custom logic 
> on every of the value I got by grouping. So I did this:
>
> lets suppose it is in a dataframe df.
>
> *case class key_class(key: string, code: string, code_value: string)*
>
>
> df
> .as[key_class]
> .groupByKey(_.key)
> .mapGroups {
>   (x, groupedValues) =>
>     val status = groupedValues.map(row => {
>       // do some custom logic on row
>       ("SUCCESS")
>     }).toList
>
> }.toDF("status")
>
>
> The issue with above approach is the values I get after applying 
> groupByKey are not sorted/ordered. I want the values to be sorted by 
> the column 'code'.
>
> There is a way to do this:
>
> 1. get them in a list and then apply sort ==> this will result in OOM 
> if the iterartor is too big.
>
> 2. I think some how to apply the secondary sort, but problem with that 
> approach is I have to keep track of the key change.
>
> 3. sortWithinPartitions cannot be applied because groupBy will mess up 
> the order.
>
> 4. Another approach is:
>
> df
> .as[key_class]
> .sort("key").sort("code")
> .map {
>  // do stuff here
> }
>
> but here also I have to keep track of the key change within map 
> function, and sometimes this also overflows if the keys are skewed.
>
>
> _/*So is there any way in which I can get the values sorted after 
> grouping them by a key.??*/_
>
> _/*
> */_
>
> _/*Thanks,*/_
>
> _/*Abhinav
> */_
>


Mime
View raw message