@TD:  could you provide some guidance on this?  these same types of questions come up a lot in the field - and i'd like to have a solid answer for folks.

thanks so much!


On Wed, May 28, 2014 at 10:45 AM, Diana Carroll <dcarroll@cloudera.com> wrote:
As I understand it, Spark streaming automatically persists (replication = 2) windowed dstreams, but not regular dstreams.

So my question is, what happens in the case of worker node failure with a non-windowed dstream whose data source is a network stream?

Say I'm getting a feed of log data, and one of my workers drops out halfway through an operation.  What happens?  Non-streaming RDDs are resilient because they can be recomputed from the source file, but in this case there is no source file.  If the original data from the stream wasn't replicated, does that mean it's just lost?  Will the task just fail?  Will the job fail?

Also, I tried testing on a cluster with two workers and a windowed dstream.  The "Storage" tab in the app UI does show the data being persisted, but only with single replication.  Is that because my cluster is too small?

Inline image 1