Hi Nikunj, 

Depending on what kind of stats you want to accumulate, you may want to look into the Accumulator/Accumulable API, or if you need more control, you can store these things in an external key-value store (HBase, redis, etc..) and do careful updates there. Though be careful and make sure your updates are atomic (transactions or CAS semantics) or you could run into race condition problems.


On Fri, Aug 28, 2015 at 11:39 AM N B <nb.nospam@gmail.com> wrote:
Hi all,

I have the following use case that I wanted to get some insight on how to go about doing in Spark Streaming.

Every batch is processed through the pipeline and at the end, it has to update some statistics information. This updated info should be reusable in the next batch of this DStream e.g for looking up the relevant stat and it in turn refines the stats further. It has to continue doing this for every batch processed. First batch in the DStream can work with empty stats lookup without issue. Essentially, we are trying to do a feedback loop.

What is a good pattern to apply for something like this? Some approaches that I considered are:

1. Use updateStateByKey(). But this produces a new DStream that I cannot refer back in the pipeline, so seems like a no-go but would be happy to be proven wrong.

2. Use broadcast variables to maintain this state in a Map for example and continue re-brodcasting it after every batch. I am not sure if this has performance implications or if its even a good idea.

3. IndexedRDD? Looked promising initially but I quickly realized that it might have the same issue as the updateStateByKey() approach, i.e. its not available in the pipeline before its created.

4. Any other ideas that are obvious and I am missing?