hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Lucene from HBase - raw values in Lucene index or not?
Date Wed, 17 Dec 2008 16:10:39 GMT
Hey Jonathan

Firstly, thank you for your considered response and offer for help...
this is much appreciated.

So, we have mysql solution now, and a big fat server (64G ram), with
150M or so records in the biggest table.  It acts as an index of
several thousand resources that are harvested from around the world
(all biodiversity data and all this is open source / 'open access with
citation').  Because it is an index and the nature of the data (150M
specimens, but tied to hierarchical trees - "taxonomy") it does not
partition nicely (geospatially, taxonomically, temporally) since all
partitions would need hit for one or other queries.  As we have grown
we have had to limit the fields we offer as search fields.  We expect
and aim to grow to Billions in the coming months.  This is probably a
familiar scenario huh?
I am working alone and in my 'spare time' to try and produce a
prototype HDFS based version as I described and learning as fast as I
can - I must admit that Hadoop, Hbase, Lucene are all new to me in the
recent months.  I blog things (I am not an eloquent blogger as it is
normally late on Sunday night) here: http://biodivertido.blogspot.com/

To specifically answer your questions:
 - ranking is largely irrelevant for us, since all records are considered equal
   - I am not entirely sure what "relevancy ranking" and assume it is
the search ranking?

 - regarding the percentage - well this really depends:
   - most searches will be small results - so index makes sense
   - also, I look to generate maps quickly so might need latitude and
longitude in indexes
   - we need to offer slicing of data
     - we are an international network, and countries / thematic
networks want data dumps for their own analysis
     - therefore USA is 50%+ of the data, but a specialist Butterfly
site is small.... therefore perhaps an index to do a count and then
pick a slice strategy?

So you have now added Tokyo to the long list of things I need to learn
- but I will look into it...

My working so far is my spare time and on EC2 and S3 and I am working
alone - I am not sure how interested you are in getting involved?  I
would greatly appreciate a knowledgeable pair of eyes to sanity check
things I am doing (e.g. scanning HBase, parsing XML to raw fields,
then scanning again and parsing to interpreted values etc), but would
welcome anyone who is interested in contributing.  It might take some
set-up time though as I have all the pieces working in isolation and
don't keep ec2 running all the time.

Thanks again, I'll continue to post progress


On Wed, Dec 17, 2008 at 4:20 PM, Jonathan Gray <jlist@streamy.com> wrote:
> Hey Tim,
> I have dabbled with sharding of a Solr index.  We applied a consistent
> hashing algorithm to our IDs to determine which node to insert to.
> One downside, not sure if this exists with Katta, is that you don't have
> good relevancy across indexes.  For example, distributed querying is really
> just querying each shard.  Unfortunately the relevancy ranking is only
> relevant within each individual index, there is no global rank.  One idea is
> to shard based on another parameter which might allow you to apply
> relative-relevancy ;) given any domain-specific information.
> I'm very interested in your problem, right now our indexes are small enough
> that we will be able to get by with 1 or 2 well-equipped nodes, but soon
> enough we will outgrow that and be looking at sharding across 5-10 nodes.
> Our results are usually "page" size (10-20) so we don't have the same issue
> with how to efficiently fetch them.
> In these cases where you might be looking for 10M records, what percentage
> of the total dataset is that?  100M, 1B?  If you turn to a full scan as your
> solution, you're going to serious limit how fast you can go even with good
> caching and faster IO.  But if you're returning a significant number of
> total rows, then this would definitely make sense.
> If your data is relatively static, you might look at writing a very simple
> disk-based key/val cache like Berkeley DB or my favorite Tokyo Cabinet.
> These can handle high numbers of records, stored on disk, but accessible in
> sub-ms time.  I have C and Java code to work with Tokyo and HBase together.
> With such a high number of records, it's probably not feasible to keep them
> in memory, so a solution like this could be your best bet.  Also, stay tuned
> to this issue as it would create a situation similar to running a disk-based
> key/val by using Direct IO (preliminary testing shows 10X random-read
> improvement):  https://issues.apache.org/jira/browse/HADOOP-4801
> And this is an old issue that will have new life soon:
> https://issues.apache.org/jira/browse/HBASE-80
> Like I said, I have an interest in seeing how to solve this problem, so let
> me know if you have any other questions or if we can help in any way.
> Jonathan Gray
>> -----Original Message-----
>> From: tim robertson [mailto:timrobertson100@gmail.com]
>> Sent: Tuesday, December 16, 2008 11:42 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
>> Hi,
>> Thanks for the help.
>> My Lucene indexes are for sure going to be too large for one machine,
>> so I plan to put the indexes on the HDFS, and then let Katta
>> distribute them around a few machines.  Because of Katta's ability to
>> do this, I went for Lucene and not SOLR, which requires me to do all
>> the sharding myself, if I understand distributed SOLR correctly - I
>> would much prefer SOLR's primitive handling as right now I convert all
>> dates and Ints manually.  If someone has distributed SOLR (really is
>> too big for one machine since indexes are >50G) I'd love to hear how
>> they sharded nicely and mange it.
>> Regarding performance... well, for "reports" that will return 10M
>> records, I will be quite happy with minutes as a response time, as
>> this is typically data download for scientific analysis, and therefore
>> people are happy to wait.  The results get put on to Amazon S3 GZipped
>> for download.  What worries me is if I have 10-100 reports running at
>> one time, there is an awful lot of single record requests on HBase.  I
>> guess I will try and blog the findings.
>> I am following HBase, Katta and Hadoop code trunks so will also try
>> and always use the latest, as this is a research project and not
>> production right now (production is still mysql based).
>> The alternative of course is to always open a scanner and then do a
>> full table scan for each report...
>> Thanks
>> Tim
>> On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <jlist@streamy.com>
>> wrote:
>> > If I understand your system (and Lucene) correctly, you obviously
>> must input
>> > all queried fields to Lucene.  And the indexes will be stored for the
>> > documents.
>> >
>> > Your question is about whether to also store the raw fields in Lucene
>> or
>> > just store indexes in Lucene?
>> >
>> > A few things you might consider...
>> >
>> > - Scaling Lucene is much more difficult than scaling HBase.  Storing
>> indexes
>> > and raw content is going to grow your Lucene instance fast.  Scaling
>> HBase
>> > is easy and you're going to have constant performance whereas Lucene
>> > performance will degrade significantly as it grows.
>> >
>> > - Random access to HBase currently leaves something to be desired.
>> What
>> > kind of performance are you looking for with 1M random fetches?
>> There is
>> > major work being done for 0.19 and 0.20 that will really help with
>> > performance as stack mentioned.
>> >
>> > - With 1M random reads, you might never get the performance out of
>> HBase
>> > that you want, certainly not if you're expecting 1M fetches to be
>> done in
>> > "realtime" (~100ms or so). However, depending on your dataset and
>> access
>> > patterns, you might be able to get sufficient performance with
>> caching
>> > (either block that is currently available, or record caching slated
>> for 0.20
>> > but likely with a patch available soon).
>> >
>> > We are using Lucene by way of Solr and are not storing the raw data
>> in
>> > Lucene.  We have an external Memcached-like cache so that our raw
>> content
>> > fetches are sufficiently quick.  My team is currently working on
>> building
>> > this cache into HBase.
>> >
>> > I'm not sure if the highlighting features in Solr are only part of
>> Solr or
>> > also in Lucene, but of course you lose the ability to do those things
>> if you
>> > don't put the raw content into Lucene.
>> >
>> > JG
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: stack [mailto:stack@duboce.net]
>> >> Sent: Tuesday, December 16, 2008 2:37 PM
>> >> To: hbase-user@hadoop.apache.org
>> >> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
>> >>
>> >> Interesting question.
>> >>
>> >> Would be grand if you didn't have to duplicate the hbase data in the
>> >> lucene index, just store the hbase locations -- or, just store small
>> >> stuff in the lucene index and leave big-stuff back in hbase -- but
>> >> perhaps the double hop of lucene first and then to hbase will not
>> >> perform well enough?  0.19.0 hbase will be better than 0.18.0 if you
>> >> can
>> >> wait a week or so for the release candiate to test.
>> >>
>> >> Let us know how it goes Tim,
>> >> St.Ack
>> >>
>> >>
>> >> tim robertson wrote:
>> >> > Hi All,
>> >> >
>> >> > I have HBase running now, building Lucene indexes on Hadoop
>> >> > successfully and then I will get Katta running for distributing my
>> >> > indexes.
>> >> >
>> >> > I have around 15 search fields indexed that I wish to extract and
>> >> > return those 15 to the user in the result set - my result sets
>> will
>> >> be
>> >> > up to millions of records...
>> >> >
>> >> > Should I:
>> >> >
>> >> >   a) have the values stored in the Lucene index which will make it
>> >> > slower to search but returns the results immediately in pages
>> without
>> >> > hitting HBase
>> >> >
>> >> > or
>> >> >
>> >> >   b) Not store the data in the index but page over the Lucene
>> index
>> >> > and do millions of "get by ROWKEY" on HBase
>> >> >
>> >> > Obviously this is not happening synchronously while the user
>> waits,
>> >> > but looking forward to hear if people have done similar scenarios
>> and
>> >> > what worked out nicely...
>> >> >
>> >> > Lucene degrades in performance at large page numbers (100th page
>> of
>> >> > 1000 results) right?
>> >> >
>> >> > Thanks for any insights,
>> >> >
>> >> > Tim
>> >> >
>> >
>> >

View raw message