chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: chukwa suitability for collecting data from multiple datanodes to HDFS
Date Mon, 29 Jun 2009 22:47:51 GMT
Chukwa does indeed aim to solve the problem you have. Reliability is a
goal for us, but not our highest priority.  The current implementation
falls short in a few places. These are known bugs, and shouldn't be
hard to fix -- they just haven't been priorities.

As to reliability mechanisms:

1) Chunks of sent data have sequence IDs, so you can tell what file it
came from, and what part of the file it is. This allows post-facto
detection of data loss or duplication.

2) Agents checkpoint themselves periodically, so if they crash, data
might be duplicated, but won't be lost.

3) There are a few times when a collector can crash, and data hasn't
yet been committed to stable storage. This is mostly a  consequence of
not having flush() in HDFS yet.  It is possible to hack around that.
Fixing this will require writing some code, but not any major
architectural change.

On Mon, Jun 29, 2009 at 3:20 PM, Ken White<> wrote:
> Hi all!
> I am trying to use Hadoop for processing large amount of data (ad network
> statistics). This data is gathered on multiple nodes. Some of it is precious -
> it must not be lost before it is processed. Since HDFS doesn't work well with
> multiple small file uploads we are looking for a reliable (fault tolerant)
> solution to upload this data to HDFS as it is generated.
> If I understand correctly, Chukwa does just that? The main purpose is
> different, but it collects data from multiple nodes and writes it to HDFS,
> which is basically the same.

Ari Rabkin
UC Berkeley Computer Science Department

View raw message