spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evo Eftimov" <>
Subject RE: Spark Streaming and reducing latency
Date Sun, 17 May 2015 16:55:57 GMT
This is the nature of Spark Streaming as a System Architecture:


1.       It is a batch processing system architecture (Spark Batch) optimized for Streaming

2.       In terms of sources of Latency in such System Architecture, bear in mind that besides
“batching”, there is also the Central “Driver” function/module, which is essentially
a Central Job/Task Manager (ie running on a dedicated node, which doesn’t sit on the Path
of the Messages), which even in a Streaming Data scenario, FOR EACH Streaming BATCH schedules
tasks (as per the DAG for the streaming job), sends them to the workers, receives the results,
then schedules and sends more tasks (as per the DAG for the job) and so on and so forth  


In terms of Parallel Programming Patterns/Architecture, the above is known as Data Parallel
Architecture with Central Job/Task Manager.


There are other alternatives for achieving lower latency and in terms of Parallel Programming
Patterns they are known as Pipelines or Task Parallel Architecture – essentially every messages
streams individually through an assembly line of Tasks. As the tasks can be run on multiple
cores of one box or in a distributed environment. Storm for example implements this pattern
or you can just put together your own solution 


From: Akhil Das [] 
Sent: Sunday, May 17, 2015 4:04 PM
To: dgoldenberg
Subject: Re: Spark Streaming and reducing latency


With receiver based streaming, you can actually specify spark.streaming.blockInterval which
is the interval at which the receiver will fetch data from the source. Default value is 200ms
and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with
sparkstreaming when your processing time goes beyond your batch duration and you are having
a higher data consumption then you will overwhelm the receiver's memory and hence will throw
up block not found exceptions. 


Best Regards


On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <> wrote:

I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.

With the discretized streams, streaming is done with batch intervals i.e.
the consumer has to wait the interval to be able to get at the new items. If
one wants to reduce latency it seems the only way to do this would be by
reducing the batch interval window. However, that may lead to a great deal
of churn, with many requests going into Kafka out of the consumers,
potentially with no results whatsoever as there's nothing new in the topic
at the moment.

Is there a counter-argument to this reasoning? What are some of the general
approaches to reduce latency  folks might recommend? Or, perhaps there are
ways of dealing with this at the streaming API level?

If latency is of great concern, is it better to look into streaming from
something like Flume where data is pushed to consumers rather than pulled by
them? Are there techniques, in that case, to ensure the consumers don't get
overwhelmed with new data?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:


View raw message