spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anwar Rizal <>
Subject Re: spark streaming questions
Date Tue, 17 Jun 2014 15:54:26 GMT
On Tue, Jun 17, 2014 at 5:39 PM, Chen Song <> wrote:

> Hey
> I am new to spark streaming and apologize if these questions have been
> asked.
> * In StreamingContext, reduceByKey() seems to only work on the RDDs of the
> current batch interval, not including RDDs of previous batches. Is my
> understanding correct?

It's correct.

> * If the above statement is correct, what functions to use if one wants to
> do processing on the continuous stream batches of data? I see 2 functions,
> reduceByKeyAndWindow and updateStateByKey which serve this purpose.

I presume that you need to keep a state that goes beyond one batch, so
multiple batches. In this case, yes, updateStateByKey is the one you will
use. Basically, updateStateByKey wraps a state into an RDD.

> My use case is an aggregation and doesn't fit a windowing scenario.
> * As for updateStateByKey, I have a few questions.
> ** Over time, will spark stage original data somewhere to replay in case
> of failures? Say the Spark job run for weeks, I am wondering how that
> sustains?
> ** Say my reduce key space is partitioned by some date field and I would
> like to stop processing old dates after a period time (this is not a simply
> windowing scenario as which date the data belongs to is not the same thing
> when the data arrives). How can I handle this to tell spark to discard data
> for old dates?

You will need to call checkpoint (see
 that will persist the metadata of RDD that will consume memory (and stack
execution) otherwise. You can set the interval of checkpointing that suits
your need.

Now, if you want to also reset your state after some times, there is no
immediate way I can think of ,but you can do it through updateStateByKey,
maybe by book-keeping the timestamp.

> Thank you,
> Best
> Chen

View raw message