spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tathagata Das (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming on driver failure
Date Tue, 21 Oct 2014 02:08:34 GMT

     [ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tathagata Das updated SPARK-3129:
---------------------------------
    Description: 
Spark Streaming can small amounts of data when the driver goes down - and the sending system
cannot re-send the data (or the data has already expired on the sender side). This currently
affects all receivers. 

The solution we propose is to reliably store all the received data into HDFS. This will allow
the data to persist through driver failures, and therefore can be processed when the driver
gets restarted. 

The high level design doc for this feature is given here. 
https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing

This major task has been divided in sub-tasks
- Implementing a write ahead log management system that can manage rolling write ahead logs
- write to log, recover on failure and clean up old logs
- Implementing a HDFS backed block RDD that can read data either from Spark's BlockManager
or from HDFS files
- Implementing a ReceivedBlockHandler interface that abstracts out the functionality of saving
received blocks
- Implementing a ReceivedBlockTracker and other associated changes in the driver that allows
metadata of received blocks and block-to-batch allocations to be recovered upon driver retart

  was:Spark Streaming can small amounts of data when the driver goes down - and the sending
system cannot re-send the data (or the data has already expired on the sender side). The document
attached has more details. 


> Prevent data loss in Spark Streaming on driver failure
> ------------------------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>    Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.0.3
>            Reporter: Hari Shreedharan
>            Assignee: Tathagata Das
>            Priority: Critical
>         Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the sending
system cannot re-send the data (or the data has already expired on the sender side). This
currently affects all receivers. 
> The solution we propose is to reliably store all the received data into HDFS. This will
allow the data to persist through driver failures, and therefore can be processed when the
driver gets restarted. 
> The high level design doc for this feature is given here. 
> https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing
> This major task has been divided in sub-tasks
> - Implementing a write ahead log management system that can manage rolling write ahead
logs - write to log, recover on failure and clean up old logs
> - Implementing a HDFS backed block RDD that can read data either from Spark's BlockManager
or from HDFS files
> - Implementing a ReceivedBlockHandler interface that abstracts out the functionality
of saving received blocks
> - Implementing a ReceivedBlockTracker and other associated changes in the driver that
allows metadata of received blocks and block-to-batch allocations to be recovered upon driver
retart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message