Hi 

I'm struggling with the following issue.
I need to build a cube with 6 dimensions for app usage
for example:
-------+-------+------+-----+------+------
user  |  app |   d3  | d4  | d5   |  d6
-------+-------+------+-----+------+------
  u1   |  a1   |   x    |   y   |   z   |   5
-------+-------+------+-----+------+------
  u2   |  a1   |   a    |   b  |   c    |   6
-------+-------+------+-----+------+------

the dimensions combinations generate ~100M rows daily.
for each row, I need to calculate the unique monthly active users, weekly active users and daily active users, along with some other data (that can be simply added up)

I can load the data of the last 30 days, each day, and calculate a cube with countDistinct('userId)
but this requires a huge cluster, and is quite expensive.

I tried to use Hyper Log Log, and store the byte array of the HLL of the previous day, de-serialize it, add the users of the current day, calc the new distinct, and serialize the byte array for the next day.
however, to get 5% error accuracy with HLL, the byte array has to be 4K long, which makes the 100M rows, be ~ 4000 times bigger.  and i ended up requiring a lot more resources.

I wonder if one of you can think of a better solution.

Thanks
Tal

 



--
Tal Grynbaum / CTO & co-founder

        mobile retention done right