spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diana Carroll <>
Subject persistence and fault tolerance in Spark Streaming
Date Wed, 28 May 2014 17:45:47 GMT
As I understand it, Spark streaming automatically persists (replication =
2) windowed dstreams, but not regular dstreams.

So my question is, what happens in the case of worker node failure with a
non-windowed dstream whose data source is a network stream?

Say I'm getting a feed of log data, and one of my workers drops out halfway
through an operation.  What happens?  Non-streaming RDDs are resilient
because they can be recomputed from the source file, but in this case there
is no source file.  If the original data from the stream wasn't replicated,
does that mean it's just lost?  Will the task just fail?  Will the job fail?

Also, I tried testing on a cluster with two workers and a windowed dstream.
 The "Storage" tab in the app UI does show the data being persisted, but
only with single replication.  Is that because my cluster is too small?

[image: Inline image 1]


View raw message