spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: [Pyspark 2.4] Best way to define activity within different time window
Date Sun, 09 Jun 2019 17:17:12 GMT
Depending on what accuracy is needed, hyperloglogs can be an interesting alternative 
https://en.m.wikipedia.org/wiki/HyperLogLog

> Am 09.06.2019 um 15:59 schrieb big data <bigdatabase@outlook.com>:
> 
> From m opinion, Bitmap is the best solution for active users calculation. Other solution
almost bases on count(distinct) calculation process, which is more slower.
> 
> If you 've implemented Bitmap solution including how to build Bitmap, how to load Bitmap,
then Bitmap is the best choice.
> 
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>> Hi All,
>> 
>> Is there a best practice around calculating daily, weekly, monthly, quarterly, yearly
active users?
>> 
>> One approach is to create a window of daily bitmap and aggregate it based on period
later. However I was wondering if anyone has a better approach to tackling this problem..

>> 
>> -- 
>> Regards,
>> 
>> Rishi Shah

Mime
View raw message