kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Ingles <p...@forward.co.uk>
Subject Kafka and Hadoop
Date Thu, 11 Aug 2011 11:20:20 GMT

I'm evaluating using Kafka to aggregate some web logs and additional activity tracking for
one of our projects. I'd like to know a little more about the best way to stitch things together.

The application runs across EC2 and some internal hardware. We also run a Hadoop cluster inside
our office. I'd like to use Kafka to help aggregate activity together, augment it with something
like Esper to do some systems monitoring work, and to pull data down to our Hadoop cluster
(and ultimately into some Hive tables) for doing some offline analysis also.

I notice in the hadoop-consumer README (https://github.com/kafka-dev/kafka/tree/master/contrib/hadoop-consumer)
it's necessary to provide the HDFS location of the input files.

I was wondering whether people had recommendations on good ways to pull data onto HDFS? My
current thinking would be to use something to replicate the topic offsets onto S3 periodically
and then to run distcp to periodically copy them onto HDFS?

Thanks for any tips,
View raw message