hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Large webmail storage and Hbase
Date Wed, 19 Nov 2008 19:14:20 GMT
Edward,

We have a user-facing website backed fully by HBase.

Like Joost, we have significant random reading and to this point
out-of-the-box performance for random reading on HBase is not sufficient.
We have a very similar system to memcached to solve this issue.  We also
have external indexes to deal with sorting, secondary indexing, etc.

Blockcache can help significantly depending on your usage patterns and the
0.20 release of HBase is heavily focused on random read performance, though
this is still months away.

I would say it's certainly possible to build a webmail system on top of
HBase, but if running on 0.18/0.19 you'll first want to do performance
testing with blockcache but will probably require a key/val cache like
memcached (I'm using Tokyo Cabinet).  Since e-mails are typically immutable,
this kind of cache will go a long way.

JG

> -----Original Message-----
> From: Joost Ouwerkerk [mailto:joost@www.openplaces.com]
> Sent: Wednesday, November 19, 2008 10:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Large webmail storage and Hbase
> 
> Edward,
> 
> We're working on a user-facing web system backed by Hbase.  More
> read-oriented than a mail system, but it does also have web users
> writing to
> it.  We're making heavy use of memcached because HBase random read is
> not
> fast enough.  Haven't tried BLOCKCACHE yet, but reading a random row
> from
> HBase generally costs us about 150ms, which when multiplied by 10-20
> records
> is expensive.  We think it's this slow because of the quantity of data
> we're
> transporting, but haven't fully figured it out yet -- MySQL and
> memcached
> can deliver the same quantity of data in 1/10th the time.  If you can
> model
> your data to favour reading with scanners instead of randomly, I'm sure
> you
> could do much better.  I know that the scanner code was recently
> optimized
> with a batching strategy.
> 
> We're using Solr/Lucene for secondary indexes & searching.  We often
> display
> indexed results instead of retrieving data from the database.  We
> generally
> do only one HBase getRow call per user HTTP request, the rest comes
> from
> Solr or memcached.
> 
> We haven't rolled out beyond a small alpha user group, so the system is
> not
> proven in the real world.  Like Stack says: try it and see what
> happens.
> And be prepared to switch to an ugly MySQL sharding approach if it
> doesn't
> work out.
> 
> j
> 
> On Tue, Nov 18, 2008 at 9:21 PM, Edward J. Yoon
> <edwardyoon@apache.org>wrote:
> 
> > Does anyone have some opinion about this?
> >
> > On Tue, Nov 18, 2008 at 11:18 AM, Edward J. Yoon
> <edwardyoon@apache.org>
> > wrote:
> > > Hi,
> > >
> > > I'm considering to store the large-scale web-mail data on the
> Hbase.
> > > IMO, I expect to be able to solve both real-time  and batch (e.g.
> spam
> > > filtering, from/to graph, ..., etc) issues. But I'm still not sure
> > > whether it's suitable for storing web mail data. The stable online
> > > real-time service should be possible to be a web mail service.
> > >
> > > Does anyone tried similar one (real-time application), Or know
> about
> > > gmail architecture?
> > > Any advices are welcome, Thanks!
> > >
> > > --
> > > Best Regards, Edward J. Yoon @ NHN, corp.
> > > edwardyoon@apache.org
> > > http://blog.udanax.org
> > >
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon @ NHN, corp.
> > edwardyoon@apache.org
> > http://blog.udanax.org
> >


Mime
View raw message