kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murtaza Doctor <murt...@richrelevance.com>
Subject Re: Hadoop Consumer
Date Tue, 03 Jul 2012 17:56:13 GMT
>>
>>- We have event data under the topic "foo" written to the kafka
>> Server/Broker in avro format and want to write those events to HDFS.
>>Does
>> the Hadoop consumer expect the data written to HDFS already?
>
>
>No it doesn't expect the data to be written into HDFS already... There
>wouldn't be much point to it, otherwise, no ;) ?
>

Sorry, my note was unclear. I meant the SimpleKafkaETLJob requires a
sequence file with an offset written to HDFS and then uses that as a
bookmark to pull the data from the broker?
This file has a checksum and I was trying to modify the topic in it, which
then of course messes up the checksum. I already have events generated on
my Kafka server and all I wanted to do is run SimpleKafkaETLJob to pull
out the data and write to HDFS. Was trying to fulfill the sequence file
pre-requisite and that does not seem to work for me.

>
>> Based on the
>> doc looks like the DataGenerator is pulling events from the broker and
>> writing to HDFS. In our case we only wanted to utilize the
>> SimpleKafkaETLJob to write to HDFS.
>
>
>That's what it does. It spawns a (map only) Map Reduce job that pulls in
>parallel from the broker(s) and writes that data into HDFS.
>
>
>> I am surely missing something here?
>>
>
>Maybe...? I don't know. Do tell if anything is not clear still...!

Thanks for asserting, just want to make sure I got it right.

>
>
>> - Is there a version of consumer which appends to an existing file on
>>HDFS
>> until it reaches a specific size?
>>
>
>No there isn't, as far as I know. Potential solutions to this would be:
>
>   1. Leave the data in the broker long enough for it to reach the size
>you
>   want. Running the SimpleKafkaETLJob at those intervals would give you
>the
>   file size you want. This is the simplest thing to do, but the drawback
>is
>   that your data in HDFS will be less real-time.
>   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll
>up
>   / compact your small files into one bigger file. You would need to
>come up
>   with the hadoop job that does the roll up, or find one somewhere.
>   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>makes
>   use of hadoop append instead...

These options are very useful. I like option 3 the most :)

>
>Also, you may be interested to take a look at these
>scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-co
>nsumer/>I
>posted a while ago. If you follow the links in this post, you can get
>more details about how the scripts work and why it was necessary to do the
>things it does... or you can just use them without reading. They should
>work pretty much out of the box...

Will surely give them a spin. Thanks!
>
>>
>> Thanks,
>> murtaza
>>
>>


Mime
View raw message