spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: Low Level Kafka Consumer for Spark
Date Tue, 05 Aug 2014 15:38:51 GMT
Hi

This fault tolerant aspect already taken care in the Kafka-Spark Consumer
code , like if Leader of a partition changes etc.. in ZkCoordinator.java.
Basically it does a refresh of PartitionManagers every X seconds to make
sure Partition details is correct and consumer don't fail.

Dib


On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai <saisai.shao@intel.com> wrote:

> Hi,
>
> I think this is an awesome feature for Spark Streaming Kafka interface to
> offer user the controllability of partition offset, so user can have more
> applications based on this.
>
> What I concern is that if we want to do offset management, fault tolerant
> related control and others, we have to take the role as current
> ZookeeperConsumerConnect did, that would be a big field we should take care
> of, for example when node is failed, how to pass current partition to
> another consumer and some others. I’m not sure what is your thought?
>
> Thanks
> Jerry
>
> From: Dibyendu Bhattacharya [mailto:dibyendu.bhattachary@gmail.com]
> Sent: Tuesday, August 05, 2014 5:15 PM
> To: Jonathan Hodges; dev@spark.apache.org
> Cc: user
> Subject: Re: Low Level Kafka Consumer for Spark
>
> Thanks Jonathan,
>
> Yes, till non-ZK based offset management is available in Kafka, I need to
> maintain the offset in ZK. And yes, both cases explicit commit is
> necessary. I modified the Low Level Kafka Spark Consumer little bit to have
> Receiver spawns threads for every partition of the topic and perform the
> 'store' operation in multiple threads. It would be good if the
> receiver.store methods are made thread safe..which is not now presently .
>
> Waiting for TD's comment on this Kafka Spark Low Level consumer.
>
>
> Regards,
> Dibyendu
>
>
> On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodgesz@gmail.com<mailto:
> hodgesz@gmail.com>> wrote:
> Hi Yan,
>
> That is a good suggestion.  I believe non-Zookeeper offset management will
> be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for
> September.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
>
> That should make this fairly easy to implement, but it will still require
> explicit offset commits to avoid data loss which is different than the
> current KafkaUtils implementation.
>
> Jonathan
>
>
>
>
> On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang724@gmail.com<mailto:
> yanfang724@gmail.com>> wrote:
> Another suggestion that may help is that, you can consider use Kafka to
> store the latest offset instead of Zookeeper. There are at least two
> benefits: 1) lower the workload of ZK 2) support replay from certain
> offset. This is how Samza<http://samza.incubator.apache.org/> deals with
> the Kafka offset, the doc is here<
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html>
> . Thank you.
>
> Cheers,
>
> Fang, Yan
> yanfang724@gmail.com<mailto:yanfang724@gmail.com>
> +1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108>
>
> On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwendell@gmail.com
> <mailto:pwendell@gmail.com>> wrote:
> I'll let TD chime on on this one, but I'm guessing this would be a welcome
> addition. It's great to see community effort on adding new
> streams/receivers, adding a Java API for receivers was something we did
> specifically to allow this :)
>
> - Patrick
>
> On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
> dibyendu.bhattachary@gmail.com<mailto:dibyendu.bhattachary@gmail.com>>
> wrote:
> Hi,
>
> I have implemented a Low Level Kafka Consumer for Spark Streaming using
> Kafka Simple Consumer API. This API will give better control over the Kafka
> offset management and recovery from failures. As the present Spark
> KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
> control over the offset management which is not possible in Kafka HighLevel
> consumer.
>
> This Project is available in below Repo :
>
> https://github.com/dibbhatt/kafka-spark-consumer
>
>
> I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
> The KafkaReceiver uses low level Kafka Consumer API (implemented in
> consumer.kafka packages) to fetch messages from Kafka and 'store' it in
> Spark.
>
> The logic will detect number of partitions for a topic and spawn that many
> threads (Individual instances of Consumers). Kafka Consumer uses Zookeeper
> for storing the latest offset for individual partitions, which will help to
> recover in case of failure. The Kafka Consumer logic is tolerant to ZK
> Failures, Kafka Leader of Partition changes, Kafka broker failures,
>  recovery from offset errors and other fail-over aspects.
>
> The consumer.kafka.client.Consumer is the sample Consumer which uses this
> Kafka Receivers to generate DStreams from Kafka and apply a Output
> operation for every messages of the RDD.
>
> We are planning to use this Kafka Spark Consumer to perform Near Real Time
> Indexing of Kafka Messages to target Search Cluster and also Near Real Time
> Aggregation using target NoSQL storage.
>
> Kindly let me know your view. Also if this looks good, can I contribute to
> Spark Streaming project.
>
> Regards,
> Dibyendu
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message