spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: Low Level Kafka Consumer for Spark
Date Wed, 03 Sep 2014 17:38:55 GMT
Hi,

Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in consumer.kafka.client.Consumer.java .

Thanks to Chris for suggesting this change.

Regards,
Dibyendu


On Mon, Sep 1, 2014 at 2:55 AM, RodrigoB <rodrigo.boavida@aspect.com> wrote:

> Just a comment on the recovery part.
>
> Is it correct to say that currently Spark Streaming recovery design does
> not
> consider re-computations (upon metadata lineage recovery) that depend on
> blocks of data of the received stream?
> https://issues.apache.org/jira/browse/SPARK-1647
>
> Just to illustrate a real use case (mine):
> - We have object states which have a Duration field per state which is
> incremented on every batch interval. Also this object state is reset to 0
> upon incoming state changing events. Let's supposed there is at least one
> event since the last data checkpoint. This will lead to inconsistency upon
> driver recovery: The Duration field will get incremented from the data
> checkpoint version until the recovery moment, but the state change event
> will never be re-processed...so in the end we have the old state with the
> wrong Duration value.
> To make things worst, let's imagine we're dumping the Duration increases
> somewhere...which means we're spreading the problem across our system.
> Re-computation awareness is something I've commented on another thread and
> rather treat it separately.
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-td12568.html#a13205
>
> Re-computations do occur, but the only RDD's that are recovered are the
> ones
> from the data checkpoint. This is what we've seen. Is not enough by itself
> to ensure recovery of computed data and this partial recovery leads to
> inconsistency in some cases.
>
> Roger - I share the same question with you - I'm just not sure if the
> replicated data really gets persisted on every batch. The execution lineage
> is checkpointed, but if we have big chunks of data being consumed to
> Receiver node on let's say a second bases then having it persisted to HDFS
> every second could be a big challenge for keeping JVM performance - maybe
> that could be reason why it's not really implemented...assuming it isn't.
>
> Dibyendu had a great effort with the offset controlling code but the
> general
> state consistent recovery feels to me like another big issue to address.
>
> I plan on having a dive into the Streaming code and try to at least
> contribute with some ideas. Some more insight from anyone on the dev team
> will be very appreciated.
>
> tnks,
> Rod
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p13208.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message