spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Shah <rishishah.s...@gmail.com>
Subject Re: [pyspark 2.4.3] nested windows function performance
Date Mon, 21 Oct 2019 11:58:59 GMT
Hi All,

Any suggestions?

Thanks,
-Rishi

On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah <rishishah.star@gmail.com>
wrote:

> Hi All,
>
> I have a use case where I need to perform nested windowing functions on a
> data frame to get final set of columns. Example:
>
> w1 = Window.partitionBy('col1')
> df = df.withColumn('sum1', F.sum('val'))
>
> w2 = Window.partitionBy('col1', 'col2')
> df = df.withColumn('sum2', F.sum('val'))
>
> w3 = Window.partitionBy('col1', 'col2', 'col3')
> df = df.withColumn('sum3', F.sum('val'))
>
> These 3 partitions are not huge at all, however the data size is 2T
> parquet snappy compressed. This throws a lot of outofmemory errors.
>
> I would like to get some advice around whether nested window functions is
> a good idea in pyspark? I wanted to avoid using multiple filter + joins to
> get to the final state, as join can create crazy shuffle.
>
> Any suggestions would be appreciated!
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah

Mime
View raw message