spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tal Grynbaum <tal.grynb...@gmail.com>
Subject Suggestions for calculating MAU/WAU/DAU
Date Sun, 28 Aug 2016 18:48:26 GMT
Hi

I'm struggling with the following issue.
I need to build a cube with 6 dimensions for app usage
for example:
-------+-------+------+-----+------+------
user  |  app |   d3  | d4  | d5   |  d6
-------+-------+------+-----+------+------
  u1   |  a1   |   x    |   y   |   z   |   5
-------+-------+------+-----+------+------
  u2   |  a1   |   a    |   b  |   c    |   6
-------+-------+------+-----+------+------

the dimensions combinations generate ~100M rows daily.
for each row, I need to calculate the unique monthly active users, weekly
active users and daily active users, along with some other data (that can
be simply added up)

I can load the data of the last 30 days, each day, and calculate a cube
with countDistinct('userId)
but this requires a huge cluster, and is quite expensive.

I tried to use Hyper Log Log, and store the byte array of the HLL of the
previous day, de-serialize it, add the users of the current day, calc the
new distinct, and serialize the byte array for the next day.
however, to get 5% error accuracy with HLL, the byte array has to be 4K
long, which makes the 100M rows, be ~ 4000 times bigger.  and i ended up
requiring a lot more resources.

I wonder if one of you can think of a better solution.

Thanks
Tal





-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

        mobile retention done right

Mime
View raw message