spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Mocanu <>
Subject reconciling best effort real time with delayed aggregation
Date Wed, 14 Jan 2015 16:19:22 GMT
I have a question regarding design trade offs and best practices. I'm working on a real time
analytics system. For simplicity, I have data with timestamps (the key) and counts (the value).
I use DStreams for the real time aspect. Tuples w the same timestamp can be across various
RDDs and I just aggregate each RDD by timestamp and increment counters in Cassandra; this
gives correct aggregation counts based on data timestamp.
At the same time as tuple aggregations as saved into Cassandra I also show the aggregations
on a chart and also pass the data through some more complicated math formulas (they output
DStreams) which involve using updateStateByKey. These other operations are similar to moving
average in the way that if data comes late you'd have to recalculate all moving averages starting
from the date of the last delayed tuple; such is the nature of these calculations. The calculations
are not saved in db but recalculated every time data is loaded from db.

Now, in real time I do things on a best effort basis, but the database will always have correct
aggregations (even if tuples come in late for some early timestamp Cassandra will easily increment
a counter w amount from this late tuple).
In real time, when a tuple w same timestamp belongs to several RDDs, I don't aggregate by
tuple timestamp (bc that would mean reapplying the math formulas from the timestamp of the
last tuple and that is too much overhead) instead I aggregate by RDD time which is system
time when the RDD is created. This is good enough for real time.

So now you can see that the db (the truth provider) is different from real time streaming
results (best effort).

My questions:
1. From your experience, is this design I just described appropriate?
2. I'm curious how others have solved the problem of reconciling diferences in their real
time processing w batch mode. I think I read on the mailig list (several months ago) that
someone re does the aggregation step an hour after data is received (ie aggregation DStream
job is always an hour behind so that way late tuples have time to propagate to the db)
3. In case the source of data fails and it is restarted later, my design will give duplicates
unless the tuples from the database are deleted for the timestamps that the data I am re-streaming
contains. Is there a better way to avoid duplicates if running the same job twice or part
of a bigger job. (idempotency)


View raw message