lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: O/S Search Comparisons
Date Sat, 08 Dec 2007 09:51:48 GMT
>> Sometimes, when something like this comes up, it gives you the  
>> opportunity to take a step back and ask what are the things we  
>> really want Lucene to be going forward (the New Year is good for  
>> this kind of assessment as well)  What are it's strengths and  
>> weaknesses?  What can we improve in the short term and what needs  
>> to improve in the longer term?  Maybe it's just that time of year  
>> to send out your Lucene Wish List... :-)

+1

There is still something for us to learn & improve in Lucene, even if  
the comparison is necessarily apples/oranges or unfair.

Lucene was listed as not having "Result Excerpt" which isn't really  
fair,  though it is true you have to pull in contrib/highlighter to  
enable it.

> Did it crash on the 10 GB? I thought it said that it just took way  
> to long (7 times the best or something). Frankly, either case is  
> suspect. Last summer I indexed about 5 million docs with a total  
> size at the *very* least of 10 GB on my 3 year old desktop. It  
> didn't take much more than 8 hours to index and searches where  
> still lightning fast. Maybe they forgot to give the JVM more than  
> the default amount of RAM <g>

The paper just said "ht://Dig and Lucene degraded considerably their  
indexing time, and we excluded them from the final comparison".

Maybe Lucene just hit a very large segment merge and the author  
incorrectly thought something had gone wrong since the addDocument  
call was taking incredibly long?  In which case the new default  
ConcurrentMergeScheduler should improve that.  I would expect Lucene  
2.3 to now have an advantage in that it makes use of concurrency in  
the hardware, out of the box, whereas likely other older engines are  
single threaded.

I've also thought about creating a simple optional threaded layer on  
top of IndexWriter which uses multiple threads to add documents,  
under the hood.  Such a class would expose all of the methods of  
IndexWriter (would feel just like IndexWriter), except calls to add/ 
updateDocument would drop into a queue which multiple threads  
(maintained by this class) would pull from and execute.  This would  
then let Lucene make use of even more concurrency ... and saves the  
"complexity" of application writers having to manage threads above  
Lucene.

It is also possible the collection size is such that the merge cost  
was very high (too high), because the LogMergePolicy inadvertently  
optimizes every so often.  Ie, for certain "unlucky" ranges of  
collection sizes (number of documents "just above" maxBuffereDocs *  
powers-of-mergeFactor, in log-space) you will indeed see that  
amortized merge cost was far too high.  This is because  
LogMergePolicy is "pay it forward": it pays up front for continuing  
growth of the index, vs paying as-you-go which would be better.  I  
opened LUCENE-854 for this issue a while back, but it's still open.   
Eg KinoSearch's merging doesn't "inadvertently optimize" I think.

>> a) missing something in our defaults setup

I do think we've improved "out of the box defaults" in 2.3, not only  
with the speedups to indexing in LUCENE-843, but also changing the  
default to flushing at 16 MB instead of every 10 documents.  This  
ought to be a sizable improvement for users who just rely on Lucene's  
defaults (which is presumably the vast majority of users).

> - Mark
>
> Grant Ingersoll wrote:
>> All true and good points.  Lucene held up quite nicely in the  
>> search aspect (at least perf. wise) and I generally don't think  
>> making these kinds of comparisons are all that useful (we call it  
>> apple and oranges in English :-)  ).
>>
>> What I am trying to get at is if this paper was just about Lucene  
>> and never mentioned a single other system, what, if anything, can  
>> we take from it that can help us make Lucene better.   I know, for  
>> instance, from my own personal experience, that 2.3 is somewhere  
>> in the range of 3-5+ times faster than 2.2 (which I know is faster  
>> than 1.9).  That being said, the paper clearly states that Lucene  
>> was not capable of doing the WT10g docs because performance  
>> degraded too much.  Now, I know Lucene is pretty darn capable of a  
>> lot of things and people are using it to do web search, etc. at  
>> very large scales (I have personally talked w/ people doing it).   
>> So, what I worry about is that either we are:
>> a) missing something in our defaults setup
>> b) missing something in our docs and our education efforts, or
>> c) we are missing some capability in our indexing such that it is  
>> crashing
>>
>> Now, what is to be done?  It may well be nothing, but I just want  
>> to make sure we are comfortable with that decision or whether it  
>> is worth asking for a volunteer who has access to the WT10g docs  
>> to go have a look at it and see what happens.  I personally don't  
>> have access to these docs, otherwise I would try it out.  What we  
>> don't want to happen is for potential supporters/contributors to  
>> read that paper and say "Lucene isn't for me because of this."
>>
>> Sometimes, when something like this comes up, it gives you the  
>> opportunity to take a step back and ask what are the things we  
>> really want Lucene to be going forward (the New Year is good for  
>> this kind of assessment as well)  What are it's strengths and  
>> weaknesses?  What can we improve in the short term and what needs  
>> to improve in the longer term?  Maybe it's just that time of year  
>> to send out your Lucene Wish List... :-)
>>
>> Cheers,
>> Grant
>>
>> PS:  Samir, any chance of contributing back your ranking  
>> algorithms?  :-)
>>
>>
>> On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:
>>
>>> There is an expression in French that says "comparer des pommes  
>>> et des
>>> poires" which literally means "to compare apples and pears".   
>>> That's what
>>> this paper is about. For my point of view, such a comparison  
>>> would be
>>> interesting only if a cross analysis of different criterions (for  
>>> example,
>>> retrieval effectiveness (aka search quality), search time,  
>>> indexing time,
>>> index size, query language, index structure, and so on...) is done.
>>> Comparing different systems based only on one criterion is not
>>> well-grounded.  There is always a kind of trade-off: for example,  
>>> beside
>>> other parameters (ranking algorithm, frequencies statistics,  
>>> document
>>> structure, etc.), indexing with zettair is much faster than  
>>> indexing with
>>> lucene but if we consider searching time lucene is better than  
>>> zettair. Why?
>>> Because of many reasons but probably zettair hasn't the complex  
>>> document
>>> structure of lucene besides the ranking algorithm (Okapi BM25 vs.  
>>> tf-idf).
>>> Some systems computes and stores the scores at indexing time  
>>> which make them
>>> faster at searching time but less flexible if you want to change/ 
>>> implement a
>>> new ranking algorithm.
>>>
>>>>> Still, when a well-respected researcher in the field says  
>>>>> Lucene didn't do
>>> so hot in certain areas,
>>>
>>> If we consider the search quality, that's simply not true if we  
>>> know how to
>>> implement in Lucene popular ranking algorithm such OkapiBM25 (at  
>>> least).
>>> I've been working with Lucene for four years now, all experiments  
>>> of my
>>> thesis have been done using Lucene (with many adaptations to  
>>> implement the
>>> most recent ranking algorithm including different language model,  
>>> divergence
>>> from randomness, etc.).  I also participated to major IR  
>>> campaigns (NTCIR,
>>> CLEF and TREC) and the results are not bad at all (see
>>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/ 
>>> CLIR/NTCIR5
>>> -OV-CLIR-KishidaK.pdf for NTCIR-5 or
>>> http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/ 
>>> NTCIR6-OVE
>>> RVIEW.pdf for NTCIR-6, for CLEF have a look at
>>> http://www.clef-campaign.org/2006/working_notes/workingnotes2006/ 
>>> dinunzioOCL
>>> EF2006.pdf, ...)   for other information search the web ;-)
>>>
>>> Samir
>>>
>>>
>>>> -----Message d'origine-----
>>>> De : Mark Miller [mailto:markrmiller@gmail.com]
>>>> Envoyé : vendredi 7 décembre 2007 21:01
>>>> À : java-dev@lucene.apache.org
>>>> Objet : Re: O/S Search Comparisons
>>>>
>>>> Yes, and even if they did not use the stock defaults, I would  
>>>> bet there
>>>> would be complaints about what was done wrong at every turn.  
>>>> This seems
>>>> like a very difficult thing to do. How long does it take to  
>>>> fully learn
>>>> how to correctly utilize each search engine for the task at  
>>>> hand? I am
>>>> sure longer than these busy men could possibly take. It seems  
>>>> that such
>>>> a comparison could only be done legitimately if experts for each  
>>>> search
>>>> engine set up the indexing/searching processes. Even then the  
>>>> results
>>>> seem like they could be difficult to measure...eg was each search
>>>> engine
>>>> configured so that they would only break on spaces for indexing  
>>>> and do
>>>> nothing else special at all? So many small settings and  
>>>> knowledge need
>>>> to ensure each engine is on level ground...
>>>>
>>>> I doubt it will ever happen, but some sort of open source search  
>>>> off
>>>> would be pretty cool <g>. Then each camp could properly  
>>>> configure their
>>>> search engine for each task.
>>>>
>>>> - Mark
>>>>
>>>> Mike Klaas wrote:
>>>>> There is a good chance that they were using stock indexing  
>>>>> defaults,
>>>>> based on:
>>>>>
>>>>> Lucene:
>>>>> " In the present work, the simple applications
>>>>> bundled with the library were used to index the collection. "
>>>>>
>>>>> On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
>>>>>
>>>>>> Yeah, I wasn't too excited over it and I certainly didn't lose  
>>>>>> any
>>>>>> sleep over it, but there are some interesting things of note in
>>>> there
>>>>>> concerning Lucene, including the claim that it fell over on  
>>>>>> indexing
>>>>>> WT10g docs (page 40) and I am always looking for ways to improve
>>>>>> things.  Overall, I think Lucene held up pretty well in the
>>>>>> evaluation, and I know how suspect _any_ evaluation is given the
>>>>>> myriad ways of doing search.  Still, when a well-respected
>>>> researcher
>>>>>> in the field says Lucene didn't do so hot in certain areas, I  
>>>>>> don't
>>>>>> think we can dismiss them out of hand.   So regardless of the  
>>>>>> tests
>>>>>> being right or wrong, they are worth either addressing the  
>>>>>> failures
>>>>>> in Lucene or the failures in the test such that we make sure  
>>>>>> we are
>>>>>> properly educating our users on how best to use Lucene.
>>>>>>
>>>>>> I emailed the authors asking for information on how the test  
>>>>>> was run
>>>>>> etc., so we'll see if anything comes of it.
>>>>>>
>>>>>> On Dec 7, 2007, at 12:04 PM, robert engels wrote:
>>>>>>
>>>>>>> I wouldn't get too excited over this. Once again, it does not
 
>>>>>>> seem
>>>>>>> the evaluator understands the nature of GC based systems, and
 
>>>>>>> the
>>>>>>> memory statistics are quite out of whack. But it is hard to tell
>>>>>>> because there is no data on how memory consumption was actually
>>>>>>> measured.
>>>>>>>
>>>>>>> A far better way of measuring memory consumption is to cap the
>>>>>>> process at different levels (max ram sizes), and compare the
>>>>>>> performance at each level.
>>>>>>>
>>>>>>> There is also fact that a process takes memory from disk  
>>>>>>> cache, and
>>>>>>> visa versa, that heavily affects search performance, etc.
>>>>>>>
>>>>>>> Since there is no detailed data (that I could find) about system
>>>>>>> configuration, etc. the results are highly suspect.
>>>>>>>
>>>>>>> There is also no mention of performance on multi-processor  
>>>>>>> systems.
>>>>>>> Some systems (like Lucene) pay a penalty to support multi-
>>>> processing
>>>>>>> (both in Java and Lucene), and only realize this benefit when
>>>>>>> operating in a multi-processor environment.
>>>>>>>
>>>>>>> Based on the shear speed of XMLSearch and Zettair those seem
 
>>>>>>> likely
>>>>>>> candidates to inspect their design.
>>>>>>>
>>>>>>> On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
>>>>>>>
>>>>>>>> Was wondering if people have seen
>>>>>>>> http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
>>>>>>>>
>>>>>>>> Has some interesting comparisons.  Obviously, the comparison
of
>>>>>>>> Lucene indexing is done w/ 1.9 so it probably needs to be
done
>>>>>>>> again.  Just wondering if people see any opportunities to
 
>>>>>>>> improve
>>>>>>>> Lucene from it.    I am going to try and contact the authors
to
>>>> see
>>>>>>>> if I can get what there setup values were (mergeFactor, 

>>>>>>>> Analyzer,
>>>>>>>> etc.) as I think it would be interesting to run the tests
 
>>>>>>>> again on
>>>>>>>> 2.3.
>>>>>>>>
>>>>>>>> -Grant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------

>>>>>>>> ---
>>>> ---
>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-dev- 
>>>>>>>> help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------

>>>>>>> ---
>>>> -- 
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------

>>>>>> ---
>>>> -
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message