kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Arthur <mum...@gmail.com>
Subject Re: Reading Kafka directly from Pig?
Date Wed, 07 Aug 2013 15:19:38 GMT
Right now it only terminates if SimpleConsumer hits the timeout. So, in 
theory it can forever. To bound the InputFormat, I would probably add a 
max time or max number of messages to consume (in addition to the timeout).

I started by looking at the Camus code, but it was easier to whip up a 
simple InputFormat for testing. If this becomes a real thing, I would 
probably figure out how to utilize Camus.


On 8/7/13 10:49 AM, Jun Rao wrote:
> David,
> That's interesting. Kafka provides an infinite stream of data whereas Pig
> works on a finite amount of data. How did you solve the mismatch?
> Thanks,
> Jun
> On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <mumrah@gmail.com> wrote:
>> I've thrown together a Pig LoadFunc to read data from Kafka, so you could
>> load data like:
>> QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using
>> com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query');
>> The path part of the uri is the Kafka topic, and the fragment is the
>> number of partitions. In the implementation I have, it makes one input
>> split per partition. Offsets are not really dealt with at this point - it's
>> a rough prototype.
>> Anyone have thoughts on whether or not this is a good idea? I know usually
>> the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
>> data from Kafka once, is there any reason why I can't skip writing to HDFS?
>> Thanks!
>> -David

View raw message