spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Fang <>
Subject Re: Low Level Kafka Consumer for Spark
Date Mon, 04 Aug 2014 22:51:33 GMT
Another suggestion that may help is that, you can consider use Kafka to
store the latest offset instead of Zookeeper. There are at least two
benefits: 1) lower the workload of ZK 2) support replay from certain
offset. This is how Samza <> deals with
the Kafka offset, the doc is here
Thank you.


Fang, Yan
+1 (206) 849-4108

On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <> wrote:

> I'll let TD chime on on this one, but I'm guessing this would be a welcome
> addition. It's great to see community effort on adding new
> streams/receivers, adding a Java API for receivers was something we did
> specifically to allow this :)
> - Patrick
> On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
>> wrote:
>> Hi,
>> I have implemented a Low Level Kafka Consumer for Spark Streaming using
>> Kafka Simple Consumer API. This API will give better control over the Kafka
>> offset management and recovery from failures. As the present Spark
>> KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
>> control over the offset management which is not possible in Kafka HighLevel
>> consumer.
>> This Project is available in below Repo :
>> I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
>> The KafkaReceiver uses low level Kafka Consumer API (implemented in
>> consumer.kafka packages) to fetch messages from Kafka and 'store' it in
>> Spark.
>> The logic will detect number of partitions for a topic and spawn that
>> many threads (Individual instances of Consumers). Kafka Consumer uses
>> Zookeeper for storing the latest offset for individual partitions, which
>> will help to recover in case of failure. The Kafka Consumer logic is
>> tolerant to ZK Failures, Kafka Leader of Partition changes, Kafka broker
>> failures,  recovery from offset errors and other fail-over aspects.
>> The consumer.kafka.client.Consumer is the sample Consumer which uses this
>> Kafka Receivers to generate DStreams from Kafka and apply a Output
>> operation for every messages of the RDD.
>> We are planning to use this Kafka Spark Consumer to perform Near Real
>> Time Indexing of Kafka Messages to target Search Cluster and also Near Real
>> Time Aggregation using target NoSQL storage.
>> Kindly let me know your view. Also if this looks good, can I contribute
>> to Spark Streaming project.
>> Regards,
>> Dibyendu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message