lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Schreiner Wolfgang <>
Subject RE: Lucene applicability
Date Tue, 31 Aug 2010 09:17:42 GMT

Thank you all for your time to answer my questions!
However there are a few more issues which are not quite clear yet and hope to get advice on
those too:

1.) How is the index maintained? In another product where we use an indexer different from
Lucene, we got one central index and a few JBoss servers all accessing the same index. So
how does Lucene handle synchronization between multiple threads (different JVMs)? How does
it maintain the index after update/delete operations on the database?
2.) Is the index always up-to-date? In the FAQs it says we have to re-open the IndexReader
periodically ... how expensive (in computational terms) is it to do that on every request
for instance?
3.) I'm still not sure about performance. According to the FAQs we need to build our own MultiPhraseQuery
parser to support multiple terms and wildcards. For example consider 50.000.000 documents,
50.000 of them match term T1 in category A, 50.000 match term T2 in category B and 1.000.000
match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How fast is the algorithm
in this case? Who guarantees that it doesn't start at the 1 million side?  



-----Original Message-----
From: Lance Norskog [] 
Sent: Donnerstag, 26. August 2010 05:25
Subject: Re: Lucene applicability

A stepping stone to the above is that, in DB terms, a Lucene index is
only one table. It has a suite of indexing features that are very
different from database search. The features are oriented to searching
large bodies of text for "ideas" rather than concrete words. It
searches a lot faster than a DB. It also spends more time creating its
various indexes than a DB. Other points- you can't add or drop fields
or indexes.

On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
<> wrote:
> The SOLR wiki has lots of good information, start there:
> Otherwise, see below...
> On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
>> wrote:
>> Hi all,
>> We are currently evaluating potential search frameworks (such as Hibernate
>> Search) which might be suitable to use in our project (using Spring, JPA
>> with Hibernate) ...
>> I am sending this E-Mail in hope you can advise me on a few issues that
>> would help us in our decision making process.
>> 1.)    Is Lucene suitable for full text database searches? I read Lucene
>> was designed to index and search documents but how does it behave querying
>> relational data sets in general?
> Let's start be talking about the phrase "full text database searches". One
> thing virtually all db-centric
> people trip over is trying to use SOLR as if it were a database. You just
> can't think about tables. The
> first time you think about using SOLR to do something join-like, stop and
> take a deep breath and
> think about documents instead. The general approach is to flatten your data
> so that each "document"
> contains all the relevant info. Yes, this leads to de-normalization. Yes,
> denormalized data makes a
> good DBA cringe. But that's the difference between searching and using a
> "Document" is somewhat misleading. A document in SOLR terms is just a
> collection of fields. And, BTW,
> there's no requirement that each document have the same fields (very unlike
> a DB).
>> 2.)    Can we make assumptions on query performance considering combined
>> searches, range queries or structured data and wildcard searches? If we
>> consider a data structure consisting of say 3 tables and each table contains
>> a few million entries (e.g. first name, last name and address fields) and we
>> search for common values (such as 'John', 'Smith' and 'New York') where
>> a.       each value for itself and each combination would result in
>> millions of hits
> Sure, but what those assumptions are is totally dependent on how you've set
> things up. SOLR has been successfully
> used on several billion document indexes. There are tools for making all
> that work (i.e. replication, sharding, etc)
> built into SOLR. So I suspect you can make things work. Several million
> documents is not that large a data set.
> As always, there are tradeoffs between speed and complexity. But from what
> you've described
> I see no show stoppers.
>> b.      a person can have multiple first names and we want to make sure to
>> receive any combination of the last name with any first name
> This just sounds like an OR. But the queries can be pretty complex queries.
> Some examples of what you expect would help.
> See multi-valued fields. So, a "document" can have multiple "firstname"
> entries. Again, not like a DB (your reflexes will trip you
> up on this point <G>).
>> c.       we search for a last name and a range of birth dates
> Sure, range queries work just fine. Note that dates can trip you up, look at
> triedate if you experiment.
>> 3.)    Transaction safety: How does Lucene handle indexes? If we update
>> data model and index, what happens to the index if anything goes wrong as
>> soon as the data model has been persisted?
> A lot of work has been done to make SOLR quite robust if "anything goes
> wrong". That said, how are you backing up your data?
> That is, what is the source of the data you're going to index? If you're
> relying on your SOLR index to be your backup, you simply must back it up
> somewhere "often enough" to get by if your building burns down. I'd also
> think about storing your original input...
> This is no different than a DB. you have to guard against the disk crashing,
> someone walking by with a powerful magnet,  earthquake, flood, fires
> <G>.....
> Do note that if you modify your index schema, no existing documents reflect
> the new schema, you have to reindex them.
>> I hope I made the issues clear to you, just some general thoughts about how
>> Lucene would behave in a real world application scenario ... Any support or
>> pointers to helpful documents or Web links are highly appreciated!
>> Cheers for now,
>> w

Lance Norskog

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message