chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken White <ken.wh...@mailcatch.com>
Subject chukwa suitability for collecting data from multiple datanodes to HDFS
Date Mon, 29 Jun 2009 22:20:43 GMT
Hi all!

I am trying to use Hadoop for processing large amount of data (ad network 
statistics). This data is gathered on multiple nodes. Some of it is precious - 
it must not be lost before it is processed. Since HDFS doesn't work well with 
multiple small file uploads we are looking for a reliable (fault tolerant) 
solution to upload this data to HDFS as it is generated.

If I understand correctly, Chukwa does just that? The main purpose is 
different, but it collects data from multiple nodes and writes it to HDFS, 
which is basically the same. 

I was wondering what measures (if any) does Chukwa take to make sure no data 
is lost? (what happens if Collector dies - fire, flood, axe through CPU,...?) 
In other words, is reliability important for Chukwa or is it not a primary 
concern (because of different usage). 

I would appreciate other ideas how to handle small incremental data upload 
too, of course. I am a bit new in this field but I guess I am not the first 
one to have this kind of problem. :)

Thank you!

Kind regards,

Ken


Mime
View raw message