spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arush Kharbanda <ar...@sigmoidanalytics.com>
Subject Re: Idempotent count
Date Wed, 18 Mar 2015 07:59:27 GMT
Hi Binh,

It stores the state as well the unprocessed data.  It is a subset of the
records that you aggregated so far.

This provides a good reference for checkpointing.

http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing


On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van <binhnv80@gmail.com>
wrote:

> Hi Arush,
>
> Thank you for answering!
> When you say checkpoints hold metadata and Data, what is the Data? Is it
> the Data that is pulled from input source or is it the state?
> If it is state then is it the same number of records that I aggregated
> since beginning or only a subset of it? How can I limit the size of
> state that is kept in checkpoint?
>
> Thank you
> -Binh
>
> On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda <
> arush@sigmoidanalytics.com> wrote:
>
>> Hi
>>
>> Yes spark streaming is capable of stateful stream processing. With or
>> without state is a way of classifying state.
>> Checkpoints hold metadata and Data.
>>
>> Thanks
>>
>>
>> On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van <binhnv80@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am new to Spark so please forgive me if my questions is stupid.
>>> I am trying to use Spark-Streaming in an application that read data
>>> from a queue (Kafka) and do some aggregation (sum, count..) and
>>> then persist result to an external storage system (MySQL, VoltDB...)
>>>
>>> From my understanding of Spark-Streaming, I can have two ways
>>> of doing aggregation:
>>>
>>>    - Stateless: I don't have to keep state and just apply new delta
>>>    values to the external system. From my understanding, doing in this way I
>>>    may end up with over counting when there is failure and replay.
>>>    - Statefull: Use checkpoint to keep state and blindly save new state
>>>    to external system. Doing in this way I have correct aggregation result but
>>>    I have to keep data in two places (state and external system)
>>>
>>> My questions are:
>>>
>>>    - Is my understanding of Stateless and Statefull aggregation
>>>    correct? If not please correct me!
>>>    - For the Statefull aggregation, What does Spark-Streaming keep when
>>>    it saves checkpoint?
>>>
>>> Please kindly help!
>>>
>>> Thanks
>>> -Binh
>>>
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

arush@sigmoidanalytics.com || www.sigmoidanalytics.com

Mime
View raw message