hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject On storing HBase data in AWS S3
Date Wed, 07 Oct 2009 21:21:02 GMT
Hi users,

I've recently helped debugging a 0.19 HBase setup that was using S3 as
its DFS (one of the problem is discussed in another thread) and I
think I've gathered enough information to guide new users on whether
this is a valuable solution.

Short answer: don't use it for user-facing apps, consider it for
elastic EC2 clusters.

Long answer:

The main reason why you would want to store your data inside S3 would
be because of the marketed high availability and infinite scalability.
As the website says: "It gives any developer access to the same highly
scalable, reliable, fast, inexpensive data storage infrastructure that
Amazon uses to run its own global network of web sites. The service
aims to maximize benefits of scale and to pass those benefits on to
developers." BTW I don't refute any of this as in my experience this
has been mostly true.

HBase can use any filesystem supported in Hadoop, including S3, so it
seems like a no brainer to use it instead of having to setup Hadoop.
Yes indeed, but...

- You absolutely have to deploy your region servers in EC2 because of
the obvious latency and bandwidth every filesystem access will occur.
- The way the S3 code works in Hadoop, it writes on disk every inbound
and outbound file. Apart from slowing down even more every operation,
if you didn't change the hadoop.tmp.dir it will write in /tmp and that
volume on EC2 is always very very small. In fact, the first thing I
had to debug was a "No space left on device" which seems weird since
S3 should have infinite storage, but the error was really given when
data was written in the tmp folder.
- There are some unknown interactions because HBase has a very
different file usage pattern than MapReduce jobs and was optimized for
HDFS, not distant networked storage.

So if you need speed, simply don't use S3 with HBase as it will be too
slow . You can consider using it for elastic MapReduce jobs the same
way people use it with Hadoop because you don't have to keep all the
nodes up all the time.


View raw message