spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject RE: Spark Streaming and reducing latency
Date Mon, 18 May 2015 15:56:53 GMT
Thanks for the heads up mate.
On 18 May 2015 19:08, "Evo Eftimov" <evo.eftimov@isecc.com> wrote:

> Ooow – that is essentially the custom feedback loop mentioned in my
> previous emails in generic Architecture Terms and what you have done is
> only one of the possible implementations moreover based on Zookeeper –
> there are other possible designs not using things like zookeeper at all and
> hence achieving much lower latency and responsiveness
>
>
>
> Can I also give you a friendly advice – there is a looooong way FROM
> “we=Sigmoid and our custom sigmoid solution”, TO your earlier statement
> that Spark Streaming does “NOT” crash UNCEREMNOUSLY – please maintain
> responsible and objective communication and facts
>
>
>
> *From:* Akhil Das [mailto:akhil@sigmoidanalytics.com]
> *Sent:* Monday, May 18, 2015 2:28 PM
> *To:* Evo Eftimov
> *Cc:* Dmitry Goldenberg; user@spark.apache.org
> *Subject:* Re: Spark Streaming and reducing latency
>
>
>
> we = Sigmoid
>
>
>
> back-pressuring mechanism = Stoping the receiver from receiving more
> messages when its about to exhaust the worker memory. Here's a similar
> <https://issues.apache.org/jira/browse/SPARK-7398> kind of proposal if
> you haven't seen already.
>
>
>
>
>
>
> Thanks
>
> Best Regards
>
>
>
> On Mon, May 18, 2015 at 6:53 PM, Evo Eftimov <evo.eftimov@isecc.com>
> wrote:
>
> Who are “we” and what is the mysterious “back-pressuring mechanism” and is
> it part of the Spark Distribution (are you talking about implementation of
> the custom feedback loop mentioned in my previous emails below)- asking
> these because I can assure you that at least as of Spark Streaming 1.2.0,
> as Evo says Spark Streaming DOES crash in “unceremonious way” when the free
> RAM available for In Memory Cashed RDDs gets exhausted
>
>
>
> *From:* Akhil Das [mailto:akhil@sigmoidanalytics.com]
> *Sent:* Monday, May 18, 2015 2:03 PM
> *To:* Evo Eftimov
> *Cc:* Dmitry Goldenberg; user@spark.apache.org
>
>
> *Subject:* Re: Spark Streaming and reducing latency
>
>
>
> We fix the receivers rate at which it should consume at any given point of
> time. Also we have a back-pressuring mechanism attached to the receivers so
> it won't simply crashes in the "unceremonious way" like Evo said. Mesos has
> some sort of auto-scaling (read it somewhere), may be you can look into
> that also.
>
>
> Thanks
>
> Best Regards
>
>
>
> On Mon, May 18, 2015 at 5:20 PM, Evo Eftimov <evo.eftimov@isecc.com>
> wrote:
>
> And if you want to genuinely “reduce the latency” (still within the
> boundaries of the micro-batch) THEN you need to design and finely tune the
> Parallel Programming / Execution Model of your application. The
> objective/metric here is:
>
>
>
> a)      Consume all data within your selected micro-batch window WITHOUT
> any artificial message rate limits
>
> b)      The above will result in a certain size of Dstream RDD per
> micro-batch.
>
> c)       The objective now is to Process that RDD WITHIN the time of the
> micro-batch (and also account for temporary message rate spike etc which
> may further increase the size of the RDD) – this will avoid any clogging up
> of the app and will process your messages at the lowest latency possible in
> a micro-batch architecture
>
> d)      You achieve the objective stated in c by designing, varying and
> experimenting with various aspects of the Spark Streaming Parallel
> Programming and Execution Model – e.g. number of receivers, number of
> threads per receiver, number of executors, number of cores, RAM allocated
> to executors, number of RDD partitions which correspond to the number of
> parallel threads operating on the RDD etc etc
>
>
>
> Re the “unceremonious removal of DStream RDDs” from RAM by Spark Streaming
> when the available RAM is exhausted due to high message rate and which
> crashes your (hence clogged up) application the name of the condition is:
>
>
>
> Loss was due to java.lang.Exception
>
> java.lang.Exception: *Could not compute split, block*
> *input-4-1410542878200 not found*
>
>
>
> *From:* Evo Eftimov [mailto:evo.eftimov@isecc.com]
> *Sent:* Monday, May 18, 2015 12:13 PM
> *To:* 'Dmitry Goldenberg'; 'Akhil Das'
> *Cc:* 'user@spark.apache.org'
> *Subject:* RE: Spark Streaming and reducing latency
>
>
>
> You can use
>
>
>
> spark.streaming.receiver.maxRate
>
> not set
>
> Maximum rate (number of records per second) at which each receiver will
> receive data. Effectively, each stream will consume at most this number of
> records per second. Setting this configuration to 0 or a negative number
> will put no limit on the rate. See the deployment guide
> <https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications>
> in the Spark Streaming programing guide for mode details.
>
>
>
>
>
> Another way is to implement a feedback loop in your receivers monitoring
> the performance metrics of your application/job and based on that adjusting
> automatically the receiving rate – BUT all these have nothing to do  with
> “reducing the latency” – they simply prevent your application/job from
> clogging up – the nastier effect of which is when S[ark Streaming starts
> removing In Memory RDDs from RAM before they are processed by the job –
> that works fine in Spark Batch (ie removing RDDs from RAM based on LRU) but
> in Spark Streaming when done in this “unceremonious way” it simply Crashes
> the application
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg123@gmail.com
> <dgoldenberg123@gmail.com>]
> *Sent:* Monday, May 18, 2015 11:46 AM
> *To:* Akhil Das
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Streaming and reducing latency
>
>
>
> Thanks, Akhil. So what do folks typically do to increase/contract the
> capacity? Do you plug in some cluster auto-scaling solution to make this
> elastic?
>
>
>
> Does Spark have any hooks for instrumenting auto-scaling?
>
> In other words, how do you avoid overwheling the receivers in a scenario
> when your system's input can be unpredictable, based on users' activity?
>
>
> On May 17, 2015, at 11:04 AM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
> With receiver based streaming, you can actually
> specify spark.streaming.blockInterval which is the interval at which the
> receiver will fetch data from the source. Default value is 200ms and hence
> if your batch duration is 1 second, it will produce 5 blocks of data. And
> yes, with sparkstreaming when your processing time goes beyond your batch
> duration and you are having a higher data consumption then you will
> overwhelm the receiver's memory and hence will throw up block not found
> exceptions.
>
>
> Thanks
>
> Best Regards
>
>
>
> On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg123@gmail.com>
> wrote:
>
> I keep hearing the argument that the way Discretized Streams work with
> Spark
> Streaming is a lot more of a batch processing algorithm than true
> streaming.
> For streaming, one would expect a new item, e.g. in a Kafka topic, to be
> available to the streaming consumer immediately.
>
> With the discretized streams, streaming is done with batch intervals i.e.
> the consumer has to wait the interval to be able to get at the new items.
> If
> one wants to reduce latency it seems the only way to do this would be by
> reducing the batch interval window. However, that may lead to a great deal
> of churn, with many requests going into Kafka out of the consumers,
> potentially with no results whatsoever as there's nothing new in the topic
> at the moment.
>
> Is there a counter-argument to this reasoning? What are some of the general
> approaches to reduce latency  folks might recommend? Or, perhaps there are
> ways of dealing with this at the streaming API level?
>
> If latency is of great concern, is it better to look into streaming from
> something like Flume where data is pushed to consumers rather than pulled
> by
> them? Are there techniques, in that case, to ensure the consumers don't get
> overwhelmed with new data?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
>
>

Mime
View raw message