spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Ángel Vicente Sánchez <>
Subject Re: Low Level Kafka Consumer for Spark
Date Wed, 03 Dec 2014 14:33:11 GMT
My main complain about the WAL mechanism in the new reliable kafka receiver
is that you have to enable checkpointing and for some reason, even if
spark.cleaner.ttl is set to a reasonable value, only the metadata is
cleaned periodically. In my tests, using a folder in my filesystem as the
checkpoint folder, the receivedMetaData folder remains almost constant in
size but the receivedData folder is always increasing; the spark.cleaner.ttl
was configured to 300 seconds.

2014-12-03 10:13 GMT+00:00 Dibyendu Bhattacharya <>:

> Hi,
> Yes, as Jerry mentioned, the Spark -3129 (
> enabled the WAL feature
> which solves the Driver failure problem. The way 3129 is designed , it
> solved the driver failure problem agnostic of the source of the stream (
> like Kafka or Flume etc) But with just 3129 you can not achieve complete
> solution for data loss. You need a reliable receiver which should also
> solves the data loss issue on receiver failure.
> The Low Level Consumer (
> for which this email thread was started has solved that problem with Kafka
> Low Level API.
> And Spark-4062 as Jerry mentioned also recently solved the same problem
> using Kafka High Level API.
> On the Kafka High Level Consumer API approach , I would like to mention
> that Kafka 0.8 has some issue as mentioned in this wiki (
> where consumer re-balance sometime fails and that is one of the key reason
> Kafka is re-writing consumer API in Kafka 0.9.
> I know there are few folks already have faced this re-balancing issues
> while using Kafka High Level API , and If you ask my opinion, we at Pearson
> are still using the Low Level Consumer as this seems to be more robust and
> performant and we have been using this for few months without any issue
> ..and also I may be little biased :)
> Regards,
> Dibyendu
> On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <>
> wrote:
>> Hi Rod,
>> The purpose of introducing  WAL mechanism in Spark Streaming as a general
>> solution is to make all the receivers be benefit from this mechanism.
>> Though as you said, external sources like Kafka have their own checkpoint
>> mechanism, instead of storing data in WAL, we can only store metadata to
>> WAL, and recover from the last committed offsets. But this requires
>> sophisticated design of Kafka receiver with low-level API involved, also we
>> need to take care of rebalance and fault tolerance things by ourselves. So
>> right now instead of implementing a whole new receiver, we choose to
>> implement a simple one, though the performance is not so good, it's much
>> easier to understand and maintain.
>> The design purpose and implementation of reliable Kafka receiver can be
>> found in ( And in
>> future, to improve the reliable Kafka receiver like what you mentioned is
>> on our scheduler.
>> Thanks
>> Jerry
>> -----Original Message-----
>> From: RodrigoB []
>> Sent: Wednesday, December 3, 2014 5:44 AM
>> To:
>> Subject: Re: Low Level Kafka Consumer for Spark
>> Dibyendu,
>> Just to make sure I will not be misunderstood - My concerns are referring
>> to the Spark upcoming solution and not yours. I would to gather the
>> perspective of someone which implemented recovery with Kafka a different
>> way.
>> Tnks,
>> Rod
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: For additional
>> commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

View raw message