Hi

I'm struggling with the following issue.

I need to build a cube with 6 dimensions for app usage

for example:

-------+-------+------+-----+------+------

user | app | d3 | d4 | d5 | d6

-------+-------+------+-----+------+------

u1 | a1 | x | y | z | 5

-------+-------+------+-----+------+------

u2 | a1 | a | b | c | 6

-------+-------+------+-----+------+------

the dimensions combinations generate ~100M rows daily.

for each row, I need to calculate the unique monthly active users, weekly active users and daily active users, along with some other data (that can be simply added up)

I can load the data of the last 30 days, each day, and calculate a cube with countDistinct('userId)

but this requires a huge cluster, and is quite expensive.

I tried to use Hyper Log Log, and store the byte array of the HLL of the previous day, de-serialize it, add the users of the current day, calc the new distinct, and serialize the byte array for the next day.

however, to get 5% error accuracy with HLL, the byte array has to be 4K long, which makes the 100M rows, be ~ 4000 times bigger. and i ended up requiring a lot more resources.

I wonder if one of you can think of a better solution.

Thanks

Tal