spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shao, Saisai" <saisai.s...@intel.com>
Subject RE: Low Level Kafka Consumer for Spark
Date Tue, 05 Aug 2014 14:31:57 GMT
Hi,

I think this is an awesome feature for Spark Streaming Kafka interface to offer user the controllability
of partition offset, so user can have more applications based on this.

What I concern is that if we want to do offset management, fault tolerant related control
and others, we have to take the role as current ZookeeperConsumerConnect did, that would be
a big field we should take care of, for example when node is failed, how to pass current partition
to another consumer and some others. I’m not sure what is your thought?

Thanks
Jerry

From: Dibyendu Bhattacharya [mailto:dibyendu.bhattachary@gmail.com]
Sent: Tuesday, August 05, 2014 5:15 PM
To: Jonathan Hodges; dev@spark.apache.org
Cc: user
Subject: Re: Low Level Kafka Consumer for Spark

Thanks Jonathan,

Yes, till non-ZK based offset management is available in Kafka, I need to maintain the offset
in ZK. And yes, both cases explicit commit is necessary. I modified the Low Level Kafka Spark
Consumer little bit to have Receiver spawns threads for every partition of the topic and perform
the 'store' operation in multiple threads. It would be good if the receiver.store methods
are made thread safe..which is not now presently .

Waiting for TD's comment on this Kafka Spark Low Level consumer.


Regards,
Dibyendu


On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodgesz@gmail.com<mailto:hodgesz@gmail.com>>
wrote:
Hi Yan,

That is a good suggestion.  I believe non-Zookeeper offset management will be a feature in
the upcoming Kafka 0.8.2 release tentatively scheduled for September.

https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management

That should make this fairly easy to implement, but it will still require explicit offset
commits to avoid data loss which is different than the current KafkaUtils implementation.

Jonathan




On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang724@gmail.com<mailto:yanfang724@gmail.com>>
wrote:
Another suggestion that may help is that, you can consider use Kafka to store the latest offset
instead of Zookeeper. There are at least two benefits: 1) lower the workload of ZK 2) support
replay from certain offset. This is how Samza<http://samza.incubator.apache.org/> deals
with the Kafka offset, the doc is here<http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html>
. Thank you.

Cheers,

Fang, Yan
yanfang724@gmail.com<mailto:yanfang724@gmail.com>
+1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108>

On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwendell@gmail.com<mailto:pwendell@gmail.com>>
wrote:
I'll let TD chime on on this one, but I'm guessing this would be a welcome addition. It's
great to see community effort on adding new streams/receivers, adding a Java API for receivers
was something we did specifically to allow this :)

- Patrick

On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <dibyendu.bhattachary@gmail.com<mailto:dibyendu.bhattachary@gmail.com>>
wrote:
Hi,

I have implemented a Low Level Kafka Consumer for Spark Streaming using Kafka Simple Consumer
API. This API will give better control over the Kafka offset management and recovery from
failures. As the present Spark KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have
a better control over the offset management which is not possible in Kafka HighLevel consumer.

This Project is available in below Repo :

https://github.com/dibbhatt/kafka-spark-consumer


I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver. The KafkaReceiver
uses low level Kafka Consumer API (implemented in consumer.kafka packages) to fetch messages
from Kafka and 'store' it in Spark.

The logic will detect number of partitions for a topic and spawn that many threads (Individual
instances of Consumers). Kafka Consumer uses Zookeeper for storing the latest offset for individual
partitions, which will help to recover in case of failure. The Kafka Consumer logic is tolerant
to ZK Failures, Kafka Leader of Partition changes, Kafka broker failures,  recovery from offset
errors and other fail-over aspects.

The consumer.kafka.client.Consumer is the sample Consumer which uses this Kafka Receivers
to generate DStreams from Kafka and apply a Output operation for every messages of the RDD.

We are planning to use this Kafka Spark Consumer to perform Near Real Time Indexing of Kafka
Messages to target Search Cluster and also Near Real Time Aggregation using target NoSQL storage.

Kindly let me know your view. Also if this looks good, can I contribute to Spark Streaming
project.

Regards,
Dibyendu




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message