I have a question regarding design trade offs and best practices. I'm working on a real time analytics system. For simplicity, I have data with timestamps (the key) and counts (the value). I use DStreams for the real time aspect. Tuples w the same timestamp can be across various RDDs and I just aggregate each RDD by timestamp and increment counters in Cassandra; this gives correct aggregation counts based on data timestamp.

At the same time as tuple aggregations as saved into Cassandra I also show the aggregations on a chart and also pass the data through some more complicated math formulas (they output DStreams) which involve using updateStateByKey. These other operations are similar to moving average in the way that if data comes late you'd have to recalculate all moving averages starting from the date of the last delayed tuple; such is the nature of these calculations. The calculations are not saved in db but recalculated every time data is loaded from db.


Now, in real time I do things on a best effort basis, but the database will always have correct aggregations (even if tuples come in late for some early timestamp Cassandra will easily increment a counter w amount from this late tuple).

In real time, when a tuple w same timestamp belongs to several RDDs, I don't aggregate by tuple timestamp (bc that would mean reapplying the math formulas from the timestamp of the last tuple and that is too much overhead) instead I aggregate by RDD time which is system time when the RDD is created. This is good enough for real time.


So now you can see that the db (the truth provider) is different from real time streaming results (best effort).


My questions:

1. From your experience, is this design I just described appropriate?

2. I'm curious how others have solved the problem of reconciling diferences in their real time processing w batch mode. I think I read on the mailig list (several months ago) that someone re does the aggregation step an hour after data is received (ie aggregation DStream job is always an hour behind so that way late tuples have time to propagate to the db)

3. In case the source of data fails and it is restarted later, my design will give duplicates unless the tuples from the database are deleted for the timestamps that the data I am re-streaming contains. Is there a better way to avoid duplicates if running the same job twice or part of a bigger job. (idempotency)