spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <>
Subject Re: Are Spark Streaming RDDs always processed in order?
Date Mon, 06 Jul 2015 19:31:23 GMT
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
processed. Unless there are errors where the batch completely fails to get
processed, in which case the point is moot. Just reinforcing the concept
Additional information: This is true in the default configuration. You may
find references to an undocumented hidden configuration called
"spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting
that to more than 1 to get more concurrency (between output ops) *breaks*
the above guarantee.


On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia <> wrote:

> I had a similar inquiry, copied below.
> I was also looking into making an SQS Receiver reliable:
> Hope this helps.
> ---------- Forwarded message ----------
> From: Tathagata Das <>
> Date: 20 June 2015 at 17:21
> Subject: Re: Serial batching with Spark Streaming
> To: Michal Čizmazia <>
> Cc: Binh Nguyen Van <>, user <>
> No it does not. By default, only after all the retries etc related to
> batch X is done, then batch X+1 will be started.
> Yes, one RDD per batch per DStream. However, the RDD could be a union of
> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
> DStream).
> TD
> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia <>
> wrote:
> Thanks Tathagata!
> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then.
> Does the default scheduler initiate the execution of the *batch X+1*
> after the *batch X* even if tasks for the* batch X *need to be *retried
> due to failures*? If not, please could you suggest workarounds and point
> me to the code?
> One more thing was not 100% clear to me from the documentation: Is there
> exactly *1 RDD* published *per a batch interval* in a DStream?
> On 3 July 2015 at 22:12, khaledh <> wrote:
>> I'm writing a Spark Streaming application that uses RabbitMQ to consume
>> events. One feature of RabbitMQ that I intend to make use of is bulk ack
>> of
>> messages, i.e. no need to ack one-by-one, but only ack the last event in a
>> batch and that would ack the entire batch.
>> Before I commit to doing so, I'd like to know if Spark Streaming always
>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
>> before
>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1
>> is
>> finished?
>> This is crucial to the ack logic, since if RDD2 can be potentially
>> processed
>> while RDD1 is still being processed, then if I ack the the last event in
>> RDD2 that would also ack all events in RDD1, even though they may have not
>> been completely processed yet.
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

View raw message