Yes, as Jerry mentioned, the Spark -3129 ( https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature which solves the Driver failure problem. The way 3129 is designed , it solved the driver failure problem agnostic of the source of the stream ( like Kafka or Flume etc) But with just 3129 you can not achieve complete solution for data loss. You need a reliable receiver which should also solves the data loss issue on receiver failure. 

The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer) for which this email thread was started has solved that problem with Kafka Low Level API. 

And Spark-4062 as Jerry mentioned also recently solved the same problem using Kafka High Level API.

On the Kafka High Level Consumer API approach , I would like to mention that Kafka 0.8 has some issue as mentioned in this wiki (https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design) where consumer re-balance sometime fails and that is one of the key reason Kafka is re-writing consumer API in Kafka 0.9.

I know there are few folks already have faced this re-balancing issues while using Kafka High Level API , and If you ask my opinion, we at Pearson are still using the Low Level Consumer as this seems to be more robust and performant and we have been using this for few months without any issue ..and also I may be little biased :)


On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <saisai.shao@intel.com> wrote:
Hi Rod,

The purpose of introducing  WAL mechanism in Spark Streaming as a general solution is to make all the receivers be benefit from this mechanism.

Though as you said, external sources like Kafka have their own checkpoint mechanism, instead of storing data in WAL, we can only store metadata to WAL, and recover from the last committed offsets. But this requires sophisticated design of Kafka receiver with low-level API involved, also we need to take care of rebalance and fault tolerance things by ourselves. So right now instead of implementing a whole new receiver, we choose to implement a simple one, though the performance is not so good, it's much easier to understand and maintain.

The design purpose and implementation of reliable Kafka receiver can be found in (https://issues.apache.org/jira/browse/SPARK-4062). And in future, to improve the reliable Kafka receiver like what you mentioned is on our scheduler.


-----Original Message-----
From: RodrigoB [mailto:rodrigo.boavida@aspect.com]
Sent: Wednesday, December 3, 2014 5:44 AM
To: user@spark.incubator.apache.org
Subject: Re: Low Level Kafka Consumer for Spark


Just to make sure I will not be misunderstood - My concerns are referring to the Spark upcoming solution and not yours. I would to gather the perspective of someone which implemented recovery with Kafka a different way.


View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org