kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Arthur <mum...@gmail.com>
Subject Re: Scenarios of Hadoop producers and consumers
Date Wed, 31 Oct 2012 01:05:31 GMT
Indeed Hadoop is not the ideal platform for stream processing, but there are plenty of use
cases for Kakfa + Hadoop. I use it to consolidate log data from many different systems into
HDFS. I have N systems using the log4j appender producing to a Kafka broker, and then in my
Hadoop cluster I run a simple job that consumes that data and writes out an HDFS file. This,
in effect, is what other log aggregators like Flume do - however, we have Kafka in our stack
for other pub/sub stuff so it made sense to use it for log aggregation as well. 

To answer your question about consuming in Hadoop, the RecordReader will just continue to
return records until the queue is exhausted. If you could manage to produce data faster than
Hadoop was reading it out (very unlikely), the Hadoop job would run forever (or a least for
quite a while). I believe you end up with one RecordReader per Kafka partition, so allocating
more partitions would increase your throughput to Hadoop (at least until you saturate the
network between the Kafka brokers and Hadoop)

Hope this helps

On Oct 30, 2012, at 8:40 PM, Michal Haris wrote:

> When you need your data streams to be incrementally loaded into hadoop for
> offline batch processing and/or ad-hoc querying - some things cannot (or
> are expensive to) be computed in real-time. So you have a hadoop job that
> consumes kafka stream, potentially formats the data and saves into hdfs.
> On 30 October 2012 23:28, Hussein Baghdadi <hubaghdadi@hotmail.com> wrote:
>> Hi,Kafka comes with a support for Hadoop. I'm not sure what does this
>> mean.Kafka is a publish-subscribe messaging system. What are some of the
>> typical usage of Kafka-support for Hadoop producers and consumers?Well,
>> producers are easy to digest. MapReduce job emitting data to Kafka.But what
>> about Hadoop consumers?Hadoop is a batching system, not a continuous
>> running system (as Storm or Dempsy). Say Kafka gets some data, what will
>> happen?Thanks for help and time.
> -- 
> Michal Haris
> Software Engineer
> www.visualdna.com | t: +44 (0) 207 734 7033

View raw message