spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Song <chen.song...@gmail.com>
Subject Re: spark streaming questions
Date Wed, 25 Jun 2014 19:09:02 GMT
Thanks Anwar.


On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal <anrizal05@gmail.com> wrote:

>
> On Tue, Jun 17, 2014 at 5:39 PM, Chen Song <chen.song.82@gmail.com> wrote:
>
>> Hey
>>
>> I am new to spark streaming and apologize if these questions have been
>> asked.
>>
>> * In StreamingContext, reduceByKey() seems to only work on the RDDs of
>> the current batch interval, not including RDDs of previous batches. Is my
>> understanding correct?
>>
>
> It's correct.
>
>
>>
>> * If the above statement is correct, what functions to use if one wants
>> to do processing on the continuous stream batches of data? I see 2
>> functions, reduceByKeyAndWindow and updateStateByKey which serve this
>> purpose.
>>
>
> I presume that you need to keep a state that goes beyond one batch, so
> multiple batches. In this case, yes, updateStateByKey is the one you will
> use. Basically, updateStateByKey wraps a state into an RDD.
>
>
>
>
>>
>> My use case is an aggregation and doesn't fit a windowing scenario.
>>
>> * As for updateStateByKey, I have a few questions.
>> ** Over time, will spark stage original data somewhere to replay in case
>> of failures? Say the Spark job run for weeks, I am wondering how that
>> sustains?
>> ** Say my reduce key space is partitioned by some date field and I would
>> like to stop processing old dates after a period time (this is not a simply
>> windowing scenario as which date the data belongs to is not the same thing
>> when the data arrives). How can I handle this to tell spark to discard data
>> for old dates?
>>
>
> You will need to call checkpoint (see
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#rdd-checkpointing)
>  that will persist the metadata of RDD that will consume memory (and stack
> execution) otherwise. You can set the interval of checkpointing that suits
> your need.
>
> Now, if you want to also reset your state after some times, there is no
> immediate way I can think of ,but you can do it through updateStateByKey,
> maybe by book-keeping the timestamp.
>
>
>
>>
>> Thank you,
>>
>> Best
>> Chen
>>
>>
>>
>


-- 
Chen Song

Mime
View raw message