spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: Low Level Kafka Consumer for Spark
Date Mon, 25 Aug 2014 05:39:49 GMT
Hi,

Just to give you some update on this low level consumer  (
https://github.com/dibbhatt/kafka-spark-consumer), we at Pearson have been
doing good amount load and stress testing for last few weeks and initial
test results are very impressive. We did not see any data loss, no issue
related to "Out of Memory"  (as like existing consumer) with heavy load and
no issues related to Kafka offsets managements .

Regards,
Dibyendu


On Thu, Aug 7, 2014 at 4:46 AM, Tathagata Das <tathagata.das1565@gmail.com>
wrote:

> Hi Dibyendu,
>
> This is really awesome. I am still yet to go through the code to
> understand the details, but I want to do it really soon. In particular, I
> want to understand the improvements, over the existing Kafka receiver.
>
> And its fantastic to see such contributions from the community. :)
>
> TD
>
>
> On Tue, Aug 5, 2014 at 8:38 AM, Dibyendu Bhattacharya <
> dibyendu.bhattachary@gmail.com> wrote:
>
>> Hi
>>
>> This fault tolerant aspect already taken care in the Kafka-Spark Consumer
>> code , like if Leader of a partition changes etc.. in ZkCoordinator.java.
>> Basically it does a refresh of PartitionManagers every X seconds to make
>> sure Partition details is correct and consumer don't fail.
>>
>> Dib
>>
>>
>> On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai <saisai.shao@intel.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I think this is an awesome feature for Spark Streaming Kafka interface
>> to
>> > offer user the controllability of partition offset, so user can have
>> more
>> > applications based on this.
>> >
>> > What I concern is that if we want to do offset management, fault
>> tolerant
>> > related control and others, we have to take the role as current
>> > ZookeeperConsumerConnect did, that would be a big field we should take
>> care
>> > of, for example when node is failed, how to pass current partition to
>> > another consumer and some others. I’m not sure what is your thought?
>> >
>> > Thanks
>> > Jerry
>> >
>> > From: Dibyendu Bhattacharya [mailto:dibyendu.bhattachary@gmail.com]
>> > Sent: Tuesday, August 05, 2014 5:15 PM
>> > To: Jonathan Hodges; dev@spark.apache.org
>> > Cc: user
>> > Subject: Re: Low Level Kafka Consumer for Spark
>> >
>> > Thanks Jonathan,
>> >
>> > Yes, till non-ZK based offset management is available in Kafka, I need
>> to
>> > maintain the offset in ZK. And yes, both cases explicit commit is
>> > necessary. I modified the Low Level Kafka Spark Consumer little bit to
>> have
>> > Receiver spawns threads for every partition of the topic and perform the
>> > 'store' operation in multiple threads. It would be good if the
>> > receiver.store methods are made thread safe..which is not now presently
>> .
>> >
>> > Waiting for TD's comment on this Kafka Spark Low Level consumer.
>> >
>> >
>> > Regards,
>> > Dibyendu
>> >
>> >
>> > On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodgesz@gmail.com
>> <mailto:
>> > hodgesz@gmail.com>> wrote:
>> > Hi Yan,
>> >
>> > That is a good suggestion.  I believe non-Zookeeper offset management
>> will
>> > be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled
>> for
>> > September.
>> >
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
>> >
>> > That should make this fairly easy to implement, but it will still
>> require
>> > explicit offset commits to avoid data loss which is different than the
>> > current KafkaUtils implementation.
>> >
>> > Jonathan
>> >
>> >
>> >
>> >
>> > On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang724@gmail.com<mailto:
>> > yanfang724@gmail.com>> wrote:
>> > Another suggestion that may help is that, you can consider use Kafka to
>> > store the latest offset instead of Zookeeper. There are at least two
>> > benefits: 1) lower the workload of ZK 2) support replay from certain
>> > offset. This is how Samza<http://samza.incubator.apache.org/> deals
>> with
>> > the Kafka offset, the doc is here<
>> >
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html
>> >
>> > . Thank you.
>> >
>> > Cheers,
>> >
>> > Fang, Yan
>> > yanfang724@gmail.com<mailto:yanfang724@gmail.com>
>> > +1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108>
>> >
>> > On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwendell@gmail.com
>> > <mailto:pwendell@gmail.com>> wrote:
>> > I'll let TD chime on on this one, but I'm guessing this would be a
>> welcome
>> > addition. It's great to see community effort on adding new
>> > streams/receivers, adding a Java API for receivers was something we did
>> > specifically to allow this :)
>> >
>> > - Patrick
>> >
>> > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
>> > dibyendu.bhattachary@gmail.com<mailto:dibyendu.bhattachary@gmail.com>>
>> > wrote:
>> > Hi,
>> >
>> > I have implemented a Low Level Kafka Consumer for Spark Streaming using
>> > Kafka Simple Consumer API. This API will give better control over the
>> Kafka
>> > offset management and recovery from failures. As the present Spark
>> > KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
>> > control over the offset management which is not possible in Kafka
>> HighLevel
>> > consumer.
>> >
>> > This Project is available in below Repo :
>> >
>> > https://github.com/dibbhatt/kafka-spark-consumer
>> >
>> >
>> > I have implemented a Custom Receiver
>> consumer.kafka.client.KafkaReceiver.
>> > The KafkaReceiver uses low level Kafka Consumer API (implemented in
>> > consumer.kafka packages) to fetch messages from Kafka and 'store' it in
>> > Spark.
>> >
>> > The logic will detect number of partitions for a topic and spawn that
>> many
>> > threads (Individual instances of Consumers). Kafka Consumer uses
>> Zookeeper
>> > for storing the latest offset for individual partitions, which will
>> help to
>> > recover in case of failure. The Kafka Consumer logic is tolerant to ZK
>> > Failures, Kafka Leader of Partition changes, Kafka broker failures,
>> >  recovery from offset errors and other fail-over aspects.
>> >
>> > The consumer.kafka.client.Consumer is the sample Consumer which uses
>> this
>> > Kafka Receivers to generate DStreams from Kafka and apply a Output
>> > operation for every messages of the RDD.
>> >
>> > We are planning to use this Kafka Spark Consumer to perform Near Real
>> Time
>> > Indexing of Kafka Messages to target Search Cluster and also Near Real
>> Time
>> > Aggregation using target NoSQL storage.
>> >
>> > Kindly let me know your view. Also if this looks good, can I contribute
>> to
>> > Spark Streaming project.
>> >
>> > Regards,
>> > Dibyendu
>> >
>> >
>> >
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message