kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sybrandy, Casey" <Casey.Sybra...@Six3Systems.com>
Subject RE: Hadoop Consumer
Date Tue, 03 Jul 2012 16:34:48 GMT
>> - Is there a version of consumer which appends to an existing file on HDFS
>> until it reaches a specific size?
>>
>
>No there isn't, as far as I know. Potential solutions to this would be:
>
>   1. Leave the data in the broker long enough for it to reach the size you
>   want. Running the SimpleKafkaETLJob at those intervals would give you the
>   file size you want. This is the simplest thing to do, but the drawback is
>   that your data in HDFS will be less real-time.
>   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up
>   / compact your small files into one bigger file. You would need to come up
>   with the hadoop job that does the roll up, or find one somewhere.
>   3. Don't use the SimpleKafkaETLJob at all and write a new job that makes
>   use of hadoop append instead...
>
>Also, you may be interested to take a look at these
>scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I
>posted a while ago. If you follow the links in this post, you can get
>more details about how the scripts work and why it was necessary to do the
>things it does... or you can just use them without reading. They should
>work pretty much out of the box...

Where I work, we discovered that you can keep a file in HDFS open and still run MapReduce
jobs against the data in that file.  What you do is you flush the data periodically (every
record for us), but you don't close the file right away.  This allows us to have data files
that contain 24 hours worth of data, but not have to close the file to run the jobs or to
schedule the jobs for after the file is closed.  You can also check the file size periodically
and rotate the files based on size.  We use Avro files, but sequence files should work too
according to Cloudera.

It's a great compromise for when you want the latest and greatest data, but don't want to
have to wait until all of the files are closed to get it.

Casey
Mime
View raw message