spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georg Heiler <georg.kf.hei...@gmail.com>
Subject Re: [pyspark] Use output of one aggregated function for another aggregated function within the same groupby
Date Thu, 25 Apr 2019 04:00:24 GMT
Is analytical window funktions to rank the result and then filter to the
desired rank.

Rishi Shah <rishishah.star@gmail.com> schrieb am Do. 25. Apr. 2019 um 05:07:

> Hi All,
>
> [PySpark 2.3, python 2.7]
>
> I would like to achieve something like this, could you please suggest best
> way to implement (perhaps highlight pros & cons of the approach in terms of
> performance)?
>
> df = df.groupby('grp_col').agg(max(date).alias('max_date'), count(when
> col('file_date') == col('max_date')))
>
> Please note 'max_date' is a result of aggregate function max inside the
> group by agg. I can definitely use multiple groupbys to achieve this but is
> there a better way? better performance wise may be?
>
> Appreciate your help!
>
>
> --
> Regards,
>
> Rishi Shah
>

Mime
View raw message