spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: Low Level Kafka Consumer for Spark
Date Wed, 03 Dec 2014 10:13:46 GMT
Hi,

Yes, as Jerry mentioned, the Spark -3129 (
https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
which solves the Driver failure problem. The way 3129 is designed , it
solved the driver failure problem agnostic of the source of the stream (
like Kafka or Flume etc) But with just 3129 you can not achieve complete
solution for data loss. You need a reliable receiver which should also
solves the data loss issue on receiver failure.

The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer)
for which this email thread was started has solved that problem with Kafka
Low Level API.

And Spark-4062 as Jerry mentioned also recently solved the same problem
using Kafka High Level API.

On the Kafka High Level Consumer API approach , I would like to mention
that Kafka 0.8 has some issue as mentioned in this wiki (
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design)
where consumer re-balance sometime fails and that is one of the key reason
Kafka is re-writing consumer API in Kafka 0.9.

I know there are few folks already have faced this re-balancing issues
while using Kafka High Level API , and If you ask my opinion, we at Pearson
are still using the Low Level Consumer as this seems to be more robust and
performant and we have been using this for few months without any issue
..and also I may be little biased :)

Regards,
Dibyendu



On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <saisai.shao@intel.com> wrote:

> Hi Rod,
>
> The purpose of introducing  WAL mechanism in Spark Streaming as a general
> solution is to make all the receivers be benefit from this mechanism.
>
> Though as you said, external sources like Kafka have their own checkpoint
> mechanism, instead of storing data in WAL, we can only store metadata to
> WAL, and recover from the last committed offsets. But this requires
> sophisticated design of Kafka receiver with low-level API involved, also we
> need to take care of rebalance and fault tolerance things by ourselves. So
> right now instead of implementing a whole new receiver, we choose to
> implement a simple one, though the performance is not so good, it's much
> easier to understand and maintain.
>
> The design purpose and implementation of reliable Kafka receiver can be
> found in (https://issues.apache.org/jira/browse/SPARK-4062). And in
> future, to improve the reliable Kafka receiver like what you mentioned is
> on our scheduler.
>
> Thanks
> Jerry
>
>
> -----Original Message-----
> From: RodrigoB [mailto:rodrigo.boavida@aspect.com]
> Sent: Wednesday, December 3, 2014 5:44 AM
> To: user@spark.incubator.apache.org
> Subject: Re: Low Level Kafka Consumer for Spark
>
> Dibyendu,
>
> Just to make sure I will not be misunderstood - My concerns are referring
> to the Spark upcoming solution and not yours. I would to gather the
> perspective of someone which implemented recovery with Kafka a different
> way.
>
> Tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message