spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georg Heiler <georg.kf.hei...@gmail.com>
Subject Re: [Pyspark 2.4] Best way to define activity within different time window
Date Tue, 11 Jun 2019 08:07:23 GMT
For grouping with each: look into grouping sets
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html

Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah <
rishishah.star@gmail.com>:

> Thank you both for your input!
>
> To calculate moving average of active users, could you comment on whether
> to go for RDD based implementation or dataframe? If dataframe, will window
> function work here?
>
> In general, how would spark behave when working with dataframe with date,
> week, month, quarter, year columns and groupie against each one one by one?
>
>
>
> On Sun, Jun 9, 2019 at 1:17 PM Jörn Franke <jornfranke@gmail.com> wrote:
>
>> Depending on what accuracy is needed, hyperloglogs can be an interesting
>> alternative
>> https://en.m.wikipedia.org/wiki/HyperLogLog
>>
>> Am 09.06.2019 um 15:59 schrieb big data <bigdatabase@outlook.com>:
>>
>> From m opinion, Bitmap is the best solution for active users calculation.
>> Other solution almost bases on count(distinct) calculation process, which
>> is more slower.
>>
>> If you 've implemented Bitmap solution including how to build Bitmap, how
>> to load Bitmap, then Bitmap is the best choice.
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>>
>> Hi All,
>>
>> Is there a best practice around calculating daily, weekly, monthly,
>> quarterly, yearly active users?
>>
>> One approach is to create a window of daily bitmap and aggregate it based
>> on period later. However I was wondering if anyone has a better approach to
>> tackling this problem..
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>>
>
> --
> Regards,
>
> Rishi Shah
>

Mime
View raw message