spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Massimiliano Tomassi <>
Subject Re: Spark Streaming Fault Tolerance (?)
Date Thu, 09 Oct 2014 20:37:20 GMT
Hello all,
I wrote a blog post around the issue I reported before:

Can I ask some feedback from who's already using Spark Streaming in
production? How do you deal with fault tolerance and scalability?
Thanks a lot for your help


On 7 October 2014 18:17, Massimiliano Tomassi <> wrote:

> Reading the Spark Streaming Programming Guide I found a couple of
> interesting points. First of all, while talking about receivers, it says:
> *"If the number of cores allocated to the application is less than or
> equal to the number of input DStreams / receivers, then the system will
> receive data, but not be able to process them."*
> Then, when talking about fault tolerance it says:
> *“However, if the worker node where a network receiver was running fails,
> then a tiny bit of data may be lost, that is, the data received by the
> system but not yet replicated to other node(s). The receiver will be
> started on a different node and it will continue to receive data.”*
> So I asked myself: what happens if I have 2 workers with 2 cores each and
> one worker dies? From what I've reported above my answer would be: the
> receiver of the dead worker will be moved to the other worker, so there
> will be 2 receivers on the same worker. But that worker has 2 cores, so it
> won't be able to process batches anymore. *Is this possible? *
> Well, I actually tried: I had 2 workers receiving from Kafka and
> processing RDDs properly. I killed one of the workers and observed the
> behaviour in the Spark web UI (port 4040). In the Streaming tab there still
> are 2 active receivers, both allocated to the only living worker. But the "Processed
> batches" number is stuck, as the evidence that no batches have been
> processed after the worker died. Also, given that the receivers are still
> active, they are updating Kafka offsets in Zookeeper, meaning that now
> those messages are lost, unless you replay them resetting the offsets
> properly (but where to start from?).
> Right, this was my test. I still hope I'm wrong, but does this mean that
> your number of workers needs to be decided at the beginning (based on the
> number of cores available) without a choice to scale the cluster if needed?
> I mean, I could use 2 workers with 3 cores each, but what if I want to add
> a new worker after a while?
> Looking forward to hear your feedback, I suppose this is a pretty
> important topic to get right.
> Thanks a lot,
> Max
> --
> ------------------------------------------------
> Massimiliano Tomassi
> ------------------------------------------------
> web:
> e-mail:
> ------------------------------------------------

Massimiliano Tomassi

View raw message