hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Lucene from HBase - raw values in Lucene index or not?
Date Wed, 17 Dec 2008 15:20:13 GMT
Hey Tim,

I have dabbled with sharding of a Solr index.  We applied a consistent
hashing algorithm to our IDs to determine which node to insert to.

One downside, not sure if this exists with Katta, is that you don't have
good relevancy across indexes.  For example, distributed querying is really
just querying each shard.  Unfortunately the relevancy ranking is only
relevant within each individual index, there is no global rank.  One idea is
to shard based on another parameter which might allow you to apply
relative-relevancy ;) given any domain-specific information.

I'm very interested in your problem, right now our indexes are small enough
that we will be able to get by with 1 or 2 well-equipped nodes, but soon
enough we will outgrow that and be looking at sharding across 5-10 nodes.
Our results are usually "page" size (10-20) so we don't have the same issue
with how to efficiently fetch them.

In these cases where you might be looking for 10M records, what percentage
of the total dataset is that?  100M, 1B?  If you turn to a full scan as your
solution, you're going to serious limit how fast you can go even with good
caching and faster IO.  But if you're returning a significant number of
total rows, then this would definitely make sense.

If your data is relatively static, you might look at writing a very simple
disk-based key/val cache like Berkeley DB or my favorite Tokyo Cabinet.
These can handle high numbers of records, stored on disk, but accessible in
sub-ms time.  I have C and Java code to work with Tokyo and HBase together.
With such a high number of records, it's probably not feasible to keep them
in memory, so a solution like this could be your best bet.  Also, stay tuned
to this issue as it would create a situation similar to running a disk-based
key/val by using Direct IO (preliminary testing shows 10X random-read
improvement):  https://issues.apache.org/jira/browse/HADOOP-4801

And this is an old issue that will have new life soon:
https://issues.apache.org/jira/browse/HBASE-80

Like I said, I have an interest in seeing how to solve this problem, so let
me know if you have any other questions or if we can help in any way.

Jonathan Gray

> -----Original Message-----
> From: tim robertson [mailto:timrobertson100@gmail.com]
> Sent: Tuesday, December 16, 2008 11:42 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
> 
> Hi,
> 
> Thanks for the help.
> 
> My Lucene indexes are for sure going to be too large for one machine,
> so I plan to put the indexes on the HDFS, and then let Katta
> distribute them around a few machines.  Because of Katta's ability to
> do this, I went for Lucene and not SOLR, which requires me to do all
> the sharding myself, if I understand distributed SOLR correctly - I
> would much prefer SOLR's primitive handling as right now I convert all
> dates and Ints manually.  If someone has distributed SOLR (really is
> too big for one machine since indexes are >50G) I'd love to hear how
> they sharded nicely and mange it.
> 
> Regarding performance... well, for "reports" that will return 10M
> records, I will be quite happy with minutes as a response time, as
> this is typically data download for scientific analysis, and therefore
> people are happy to wait.  The results get put on to Amazon S3 GZipped
> for download.  What worries me is if I have 10-100 reports running at
> one time, there is an awful lot of single record requests on HBase.  I
> guess I will try and blog the findings.
> 
> I am following HBase, Katta and Hadoop code trunks so will also try
> and always use the latest, as this is a research project and not
> production right now (production is still mysql based).
> 
> The alternative of course is to always open a scanner and then do a
> full table scan for each report...
> 
> Thanks
> 
> Tim
> 
> On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <jlist@streamy.com>
> wrote:
> > If I understand your system (and Lucene) correctly, you obviously
> must input
> > all queried fields to Lucene.  And the indexes will be stored for the
> > documents.
> >
> > Your question is about whether to also store the raw fields in Lucene
> or
> > just store indexes in Lucene?
> >
> > A few things you might consider...
> >
> > - Scaling Lucene is much more difficult than scaling HBase.  Storing
> indexes
> > and raw content is going to grow your Lucene instance fast.  Scaling
> HBase
> > is easy and you're going to have constant performance whereas Lucene
> > performance will degrade significantly as it grows.
> >
> > - Random access to HBase currently leaves something to be desired.
> What
> > kind of performance are you looking for with 1M random fetches?
> There is
> > major work being done for 0.19 and 0.20 that will really help with
> > performance as stack mentioned.
> >
> > - With 1M random reads, you might never get the performance out of
> HBase
> > that you want, certainly not if you're expecting 1M fetches to be
> done in
> > "realtime" (~100ms or so). However, depending on your dataset and
> access
> > patterns, you might be able to get sufficient performance with
> caching
> > (either block that is currently available, or record caching slated
> for 0.20
> > but likely with a patch available soon).
> >
> > We are using Lucene by way of Solr and are not storing the raw data
> in
> > Lucene.  We have an external Memcached-like cache so that our raw
> content
> > fetches are sufficiently quick.  My team is currently working on
> building
> > this cache into HBase.
> >
> > I'm not sure if the highlighting features in Solr are only part of
> Solr or
> > also in Lucene, but of course you lose the ability to do those things
> if you
> > don't put the raw content into Lucene.
> >
> > JG
> >
> >
> >
> >> -----Original Message-----
> >> From: stack [mailto:stack@duboce.net]
> >> Sent: Tuesday, December 16, 2008 2:37 PM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
> >>
> >> Interesting question.
> >>
> >> Would be grand if you didn't have to duplicate the hbase data in the
> >> lucene index, just store the hbase locations -- or, just store small
> >> stuff in the lucene index and leave big-stuff back in hbase -- but
> >> perhaps the double hop of lucene first and then to hbase will not
> >> perform well enough?  0.19.0 hbase will be better than 0.18.0 if you
> >> can
> >> wait a week or so for the release candiate to test.
> >>
> >> Let us know how it goes Tim,
> >> St.Ack
> >>
> >>
> >> tim robertson wrote:
> >> > Hi All,
> >> >
> >> > I have HBase running now, building Lucene indexes on Hadoop
> >> > successfully and then I will get Katta running for distributing my
> >> > indexes.
> >> >
> >> > I have around 15 search fields indexed that I wish to extract and
> >> > return those 15 to the user in the result set - my result sets
> will
> >> be
> >> > up to millions of records...
> >> >
> >> > Should I:
> >> >
> >> >   a) have the values stored in the Lucene index which will make it
> >> > slower to search but returns the results immediately in pages
> without
> >> > hitting HBase
> >> >
> >> > or
> >> >
> >> >   b) Not store the data in the index but page over the Lucene
> index
> >> > and do millions of "get by ROWKEY" on HBase
> >> >
> >> > Obviously this is not happening synchronously while the user
> waits,
> >> > but looking forward to hear if people have done similar scenarios
> and
> >> > what worked out nicely...
> >> >
> >> > Lucene degrades in performance at large page numbers (100th page
> of
> >> > 1000 results) right?
> >> >
> >> > Thanks for any insights,
> >> >
> >> > Tim
> >> >
> >
> >


Mime
View raw message