hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Lucene from HBase - raw values in Lucene index or not?
Date Wed, 17 Dec 2008 07:41:59 GMT

Thanks for the help.

My Lucene indexes are for sure going to be too large for one machine,
so I plan to put the indexes on the HDFS, and then let Katta
distribute them around a few machines.  Because of Katta's ability to
do this, I went for Lucene and not SOLR, which requires me to do all
the sharding myself, if I understand distributed SOLR correctly - I
would much prefer SOLR's primitive handling as right now I convert all
dates and Ints manually.  If someone has distributed SOLR (really is
too big for one machine since indexes are >50G) I'd love to hear how
they sharded nicely and mange it.

Regarding performance... well, for "reports" that will return 10M
records, I will be quite happy with minutes as a response time, as
this is typically data download for scientific analysis, and therefore
people are happy to wait.  The results get put on to Amazon S3 GZipped
for download.  What worries me is if I have 10-100 reports running at
one time, there is an awful lot of single record requests on HBase.  I
guess I will try and blog the findings.

I am following HBase, Katta and Hadoop code trunks so will also try
and always use the latest, as this is a research project and not
production right now (production is still mysql based).

The alternative of course is to always open a scanner and then do a
full table scan for each report...



On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <jlist@streamy.com> wrote:
> If I understand your system (and Lucene) correctly, you obviously must input
> all queried fields to Lucene.  And the indexes will be stored for the
> documents.
> Your question is about whether to also store the raw fields in Lucene or
> just store indexes in Lucene?
> A few things you might consider...
> - Scaling Lucene is much more difficult than scaling HBase.  Storing indexes
> and raw content is going to grow your Lucene instance fast.  Scaling HBase
> is easy and you're going to have constant performance whereas Lucene
> performance will degrade significantly as it grows.
> - Random access to HBase currently leaves something to be desired.  What
> kind of performance are you looking for with 1M random fetches?  There is
> major work being done for 0.19 and 0.20 that will really help with
> performance as stack mentioned.
> - With 1M random reads, you might never get the performance out of HBase
> that you want, certainly not if you're expecting 1M fetches to be done in
> "realtime" (~100ms or so). However, depending on your dataset and access
> patterns, you might be able to get sufficient performance with caching
> (either block that is currently available, or record caching slated for 0.20
> but likely with a patch available soon).
> We are using Lucene by way of Solr and are not storing the raw data in
> Lucene.  We have an external Memcached-like cache so that our raw content
> fetches are sufficiently quick.  My team is currently working on building
> this cache into HBase.
> I'm not sure if the highlighting features in Solr are only part of Solr or
> also in Lucene, but of course you lose the ability to do those things if you
> don't put the raw content into Lucene.
> JG
>> -----Original Message-----
>> From: stack [mailto:stack@duboce.net]
>> Sent: Tuesday, December 16, 2008 2:37 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
>> Interesting question.
>> Would be grand if you didn't have to duplicate the hbase data in the
>> lucene index, just store the hbase locations -- or, just store small
>> stuff in the lucene index and leave big-stuff back in hbase -- but
>> perhaps the double hop of lucene first and then to hbase will not
>> perform well enough?  0.19.0 hbase will be better than 0.18.0 if you
>> can
>> wait a week or so for the release candiate to test.
>> Let us know how it goes Tim,
>> St.Ack
>> tim robertson wrote:
>> > Hi All,
>> >
>> > I have HBase running now, building Lucene indexes on Hadoop
>> > successfully and then I will get Katta running for distributing my
>> > indexes.
>> >
>> > I have around 15 search fields indexed that I wish to extract and
>> > return those 15 to the user in the result set - my result sets will
>> be
>> > up to millions of records...
>> >
>> > Should I:
>> >
>> >   a) have the values stored in the Lucene index which will make it
>> > slower to search but returns the results immediately in pages without
>> > hitting HBase
>> >
>> > or
>> >
>> >   b) Not store the data in the index but page over the Lucene index
>> > and do millions of "get by ROWKEY" on HBase
>> >
>> > Obviously this is not happening synchronously while the user waits,
>> > but looking forward to hear if people have done similar scenarios and
>> > what worked out nicely...
>> >
>> > Lucene degrades in performance at large page numbers (100th page of
>> > 1000 results) right?
>> >
>> > Thanks for any insights,
>> >
>> > Tim
>> >

View raw message