spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evo Eftimov <>
Subject RE: Spark Streaming and reducing latency
Date Mon, 18 May 2015 18:28:37 GMT
My pleasure young man, i will even go beynd the so called "heads up" and send you a solution
design for Feedback Loop preventing spark streaming app clogging and resource depletion and
featuring machine learning based self-tunning AND which is not zookeeper based and hence offers
lower latency

Ps: ultimately though remember that none of this stuff is part of spark streming as of yet

Sent from Samsung Mobile

<div>-------- Original message --------</div><div>From: Akhil Das <>
</div><div>Date:2015/05/18  16:56  (GMT+00:00) </div><div>To: Evo
Eftimov <> </div><div>Cc: </div><div>Subject:
RE: Spark Streaming and reducing latency </div><div>
</div>Thanks for the heads up mate.

On 18 May 2015 19:08, "Evo Eftimov" <> wrote:
Ooow – that is essentially the custom feedback loop mentioned in my previous emails in generic
Architecture Terms and what you have done is only one of the possible implementations moreover
based on Zookeeper – there are other possible designs not using things like zookeeper at
all and hence achieving much lower latency and responsiveness


Can I also give you a friendly advice – there is a looooong way FROM “we=Sigmoid and our
custom sigmoid solution”, TO your earlier statement that Spark Streaming does “NOT”
crash UNCEREMNOUSLY – please maintain responsible and objective communication and facts


From: Akhil Das [] 
Sent: Monday, May 18, 2015 2:28 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg;
Subject: Re: Spark Streaming and reducing latency


we = Sigmoid


back-pressuring mechanism = Stoping the receiver from receiving more messages when its about
to exhaust the worker memory. Here's a similar kind of proposal if you haven't seen already.




Best Regards


On Mon, May 18, 2015 at 6:53 PM, Evo Eftimov <> wrote:

Who are “we” and what is the mysterious “back-pressuring mechanism” and is it part
of the Spark Distribution (are you talking about implementation of the custom feedback loop
mentioned in my previous emails below)- asking these because I can assure you that at least
as of Spark Streaming 1.2.0, as Evo says Spark Streaming DOES crash in “unceremonious way”
when the free RAM available for In Memory Cashed RDDs gets exhausted


From: Akhil Das [] 
Sent: Monday, May 18, 2015 2:03 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg;

Subject: Re: Spark Streaming and reducing latency


We fix the receivers rate at which it should consume at any given point of time. Also we have
a back-pressuring mechanism attached to the receivers so it won't simply crashes in the "unceremonious
way" like Evo said. Mesos has some sort of auto-scaling (read it somewhere), may be you can
look into that also.


Best Regards


On Mon, May 18, 2015 at 5:20 PM, Evo Eftimov <> wrote:

And if you want to genuinely “reduce the latency” (still within the boundaries of the
micro-batch) THEN you need to design and finely tune the Parallel Programming / Execution
Model of your application. The objective/metric here is:


a)      Consume all data within your selected micro-batch window WITHOUT any artificial
message rate limits

b)      The above will result in a certain size of Dstream RDD per micro-batch.

c)       The objective now is to Process that RDD WITHIN the time of the micro-batch
(and also account for temporary message rate spike etc which may further increase the size
of the RDD) – this will avoid any clogging up of the app and will process your messages
at the lowest latency possible in a micro-batch architecture

d)      You achieve the objective stated in c by designing, varying and experimenting
with various aspects of the Spark Streaming Parallel Programming and Execution Model – e.g.
number of receivers, number of threads per receiver, number of executors, number of cores,
RAM allocated to executors, number of RDD partitions which correspond to the number of parallel
threads operating on the RDD etc etc  


Re the “unceremonious removal of DStream RDDs” from RAM by Spark Streaming when the available
RAM is exhausted due to high message rate and which crashes your (hence clogged up) application
the name of the condition is:


Loss was due to java.lang.Exception  

java.lang.Exception: Could not compute split, block
input-4-1410542878200 not found


From: Evo Eftimov [] 
Sent: Monday, May 18, 2015 12:13 PM
To: 'Dmitry Goldenberg'; 'Akhil Das'
Cc: ''
Subject: RE: Spark Streaming and reducing latency


You can use



not set

Maximum rate (number of records per second) at which each receiver will receive data. Effectively,
each stream will consume at most this number of records per second. Setting this configuration
to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark
Streaming programing guide for mode details.



Another way is to implement a feedback loop in your receivers monitoring the performance metrics
of your application/job and based on that adjusting automatically the receiving rate – BUT
all these have nothing to do  with “reducing the latency” – they simply prevent your
application/job from clogging up – the nastier effect of which is when S[ark Streaming starts
removing In Memory RDDs from RAM before they are processed by the job – that works fine
in Spark Batch (ie removing RDDs from RAM based on LRU) but in Spark Streaming when done in
this “unceremonious way” it simply Crashes the application


From: Dmitry Goldenberg [] 
Sent: Monday, May 18, 2015 11:46 AM
To: Akhil Das
Subject: Re: Spark Streaming and reducing latency


Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug
in some cluster auto-scaling solution to make this elastic?


Does Spark have any hooks for instrumenting auto-scaling?

In other words, how do you avoid overwheling the receivers in a scenario when your system's
input can be unpredictable, based on users' activity?

On May 17, 2015, at 11:04 AM, Akhil Das <> wrote:

With receiver based streaming, you can actually specify spark.streaming.blockInterval which
is the interval at which the receiver will fetch data from the source. Default value is 200ms
and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with
sparkstreaming when your processing time goes beyond your batch duration and you are having
a higher data consumption then you will overwhelm the receiver's memory and hence will throw
up block not found exceptions. 


Best Regards


On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <> wrote:

I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.

With the discretized streams, streaming is done with batch intervals i.e.
the consumer has to wait the interval to be able to get at the new items. If
one wants to reduce latency it seems the only way to do this would be by
reducing the batch interval window. However, that may lead to a great deal
of churn, with many requests going into Kafka out of the consumers,
potentially with no results whatsoever as there's nothing new in the topic
at the moment.

Is there a counter-argument to this reasoning? What are some of the general
approaches to reduce latency  folks might recommend? Or, perhaps there are
ways of dealing with this at the streaming API level?

If latency is of great concern, is it better to look into streaming from
something like Flume where data is pushed to consumers rather than pulled by
them? Are there techniques, in that case, to ensure the consumers don't get
overwhelmed with new data?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:



View raw message