spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <>
Subject Re: Low Level Kafka Consumer for Spark
Date Wed, 03 Sep 2014 17:38:55 GMT

Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer (
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in .

Thanks to Chris for suggesting this change.


On Mon, Sep 1, 2014 at 2:55 AM, RodrigoB <> wrote:

> Just a comment on the recovery part.
> Is it correct to say that currently Spark Streaming recovery design does
> not
> consider re-computations (upon metadata lineage recovery) that depend on
> blocks of data of the received stream?
> Just to illustrate a real use case (mine):
> - We have object states which have a Duration field per state which is
> incremented on every batch interval. Also this object state is reset to 0
> upon incoming state changing events. Let's supposed there is at least one
> event since the last data checkpoint. This will lead to inconsistency upon
> driver recovery: The Duration field will get incremented from the data
> checkpoint version until the recovery moment, but the state change event
> will never be in the end we have the old state with the
> wrong Duration value.
> To make things worst, let's imagine we're dumping the Duration increases
> somewhere...which means we're spreading the problem across our system.
> Re-computation awareness is something I've commented on another thread and
> rather treat it separately.
> Re-computations do occur, but the only RDD's that are recovered are the
> ones
> from the data checkpoint. This is what we've seen. Is not enough by itself
> to ensure recovery of computed data and this partial recovery leads to
> inconsistency in some cases.
> Roger - I share the same question with you - I'm just not sure if the
> replicated data really gets persisted on every batch. The execution lineage
> is checkpointed, but if we have big chunks of data being consumed to
> Receiver node on let's say a second bases then having it persisted to HDFS
> every second could be a big challenge for keeping JVM performance - maybe
> that could be reason why it's not really implemented...assuming it isn't.
> Dibyendu had a great effort with the offset controlling code but the
> general
> state consistent recovery feels to me like another big issue to address.
> I plan on having a dive into the Streaming code and try to at least
> contribute with some ideas. Some more insight from anyone on the dev team
> will be very appreciated.
> tnks,
> Rod
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message