Hi Arush,Thank you for answering!When you say checkpoints hold metadata and Data, what is the Data? Is it the Data that is pulled from input source or is it the state?If it is state then is it the same number of records that I aggregated since beginning or only a subset of it? How can I limit the size ofstate that is kept in checkpoint?Thank you-BinhOn Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda <firstname.lastname@example.org> wrote:HiYes spark streaming is capable of stateful stream processing. With or without state is a way of classifying state.Checkpoints hold metadata and Data.Thanks--On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van <email@example.com> wrote:Hi all,I am new to Spark so please forgive me if my questions is stupid.I am trying to use Spark-Streaming in an application that read datafrom a queue (Kafka) and do some aggregation (sum, count..) andthen persist result to an external storage system (MySQL, VoltDB...)From my understanding of Spark-Streaming, I can have two waysof doing aggregation:
- Stateless: I don't have to keep state and just apply new delta values to the external system. From my understanding, doing in this way I may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state to external system. Doing in this way I have correct aggregation result but I have to keep data in two places (state and external system)My questions are:
- Is my understanding of Stateless and Statefull aggregation correct? If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when it saves checkpoint?Please kindly help!Thanks-Binh