kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grégoire Seux <g.s...@criteo.com>
Subject RE: Hadoop Consumer
Date Wed, 04 Jul 2012 16:25:07 GMT
Thanks a lot Min, this is indeed very useful. 

-- 
Greg

-----Original Message-----
From: Felix GV [mailto:felix@mate1inc.com] 
Sent: mercredi 4 juillet 2012 18:19
To: kafka-users@incubator.apache.org
Subject: Re: Hadoop Consumer

Thanks for the info, that's interesting :) ...

And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is
cool :) ...

--
Felix



On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < Casey.Sybrandy@six3systems.com> wrote:

> We're using CDH3 update 2 or 3.  I don't know how much the version 
> matters, so it may work on plain-old Hadoop.
> _____________________
> From: Murtaza Doctor [murtaza@richrelevance.com]
> Sent: Tuesday, July 03, 2012 1:56 PM
> To: kafka-users@incubator.apache.org
> Subject: Re: Hadoop Consumer
>
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <felix@mate1inc.com> wrote:
>
> >Hmm that's surprising. I didn't know about that...!
> >
> >I wonder if it's a new feature... Judging from your email, I assume 
> >you're using CDH? What version?
> >
> >Interesting :) ...
> >
> >--
> >Felix
> >
> >
> >
> >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < 
> >Casey.Sybrandy@six3systems.com> wrote:
> >
> >> >> - Is there a version of consumer which appends to an existing 
> >> >> file on
> >> HDFS
> >> >> until it reaches a specific size?
> >> >>
> >> >
> >> >No there isn't, as far as I know. Potential solutions to this would be:
> >> >
> >> >   1. Leave the data in the broker long enough for it to reach the 
> >> > size
> >> you
> >> >   want. Running the SimpleKafkaETLJob at those intervals would 
> >> > give
> >>you
> >> the
> >> >   file size you want. This is the simplest thing to do, but the
> >>drawback
> >> is
> >> >   that your data in HDFS will be less real-time.
> >> >   2. Run the SimpleKafkaETLJob as frequently as you want, and 
> >> > then
> >>roll
> >> up
> >> >   / compact your small files into one bigger file. You would need 
> >> > to
> >> come up
> >> >   with the hadoop job that does the roll up, or find one somewhere.
> >> >   3. Don't use the SimpleKafkaETLJob at all and write a new job 
> >> > that
> >> makes
> >> >   use of hadoop append instead...
> >> >
> >> >Also, you may be interested to take a look at these scripts<
> >>
> >>
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consum
> er/
> >> >I
> >> >posted a while ago. If you follow the links in this post, you can 
> >> >get more details about how the scripts work and why it was 
> >> >necessary to do
> >>the
> >> >things it does... or you can just use them without reading. They 
> >> >should work pretty much out of the box...
> >>
> >> Where I work, we discovered that you can keep a file in HDFS open 
> >>and  still run MapReduce jobs against the data in that file.  What 
> >>you do is you  flush the data periodically (every record for us), 
> >>but you don't close the  file right away.  This allows us to have 
> >>data files that contain 24 hours  worth of data, but not have to 
> >>close the file to run the jobs or to  schedule the jobs for after 
> >>the file is closed.  You can also check the  file size periodically 
> >>and rotate the files based on size.  We use Avro  files, but 
> >>sequence files should work too according to Cloudera.
> >>
> >> It's a great compromise for when you want the latest and greatest 
> >>data,  but don't want to have to wait until all of the files are 
> >>closed to get it.
> >>
> >> Casey
>
>

Mime
View raw message