spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <>
Subject [jira] [Closed] (SPARK-9947) Separate Metadata and State Checkpoint Data
Date Wed, 12 Oct 2016 23:23:21 GMT


Cody Koeninger closed SPARK-9947.
    Resolution: Won't Fix

The direct DStream api already gives access to offsets, and it seems clear that  most future
work on streaming checkpointing is going to be focused on structured streaming. SPARK-15406

> Separate Metadata and State Checkpoint Data
> -------------------------------------------
>                 Key: SPARK-9947
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.4.1
>            Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> Problem: When updating an application that has checkpointing enabled to support the updateStateByKey
and 24/7 operation functionality, you encounter the problem where you might like to maintain
state data between restarts but delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not execute properly
or at all. My current workaround for this issue is to wrap updateStateByKey with my own function
that persists the state after every update to my own separate directory. (That allows me to
delete the checkpoint with its metadata before redeploying) Then, when I restart the application,
I initialize the state with this persisted data. This incurs additional overhead due to persisting
of the same data twice: once in the checkpoint and once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint directory,
that would help address the problem of having to blow that away between code redeployment
as well.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message