spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shao, Saisai" <>
Subject RE: StreamingContext textFileStream question
Date Mon, 23 Feb 2015 19:17:05 GMT
Hi Mark,

For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed
files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will
guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file
deleted before recovering.

For input stream which is receiver based, like Kafka input stream or socket input stream,
a WAL(write ahead log) mechanism can be enabled to store the received data as well as metadata,
so data can be recovered from failure.


-----Original Message-----
From: mkhaitman [] 
Sent: Monday, February 23, 2015 10:54 AM
Subject: StreamingContext textFileStream question


I was interested in creating a StreamingContext textFileStream based job, which runs for long
durations, and can also recover from prolonged driver failure... It seems like StreamingContext
checkpointing is mainly used for the case when the driver dies during the processing of an
RDD, and to recover that one RDD, but my question specifically relates to whether there is
a way to also recover which files were missed between the timeframe of the driver dying and
being started back up (whether manually or automatically).

Any assistance/suggestions with this one would be greatly appreciated!


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail: For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message