lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Thu, 05 Apr 2007 10:58:59 GMT
"Marvin Humphrey" <> wrote:
> On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
> >> (: Ironically, the numbers for Lucene on that page are a little
> >> better than they should be because of a sneaky bug.  I would have
> >> made updating the results a priority if they'd gone the other  
> >> way.  :)
> >
> > Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
> > Ferret and others?).
> Doing honest, rigorous benchmarking is exacting and labor-intensive.   
> Publishing results tends to ignite flame wars I don't have time for.
> The main point that I wanted to make with that page was that KS was a  
> lot faster than Plucene, and that it was in Lucene's ballpark.   
> Having made that point, I've moved on.  The benchmarking code is  
> still very useful for internal development and I use it frequently.

Agreed.  Though, if the benchmarking is done in a way that anyone
could download & re-run it (eg as part of Lucene's new & developing
benchmark framework), it should help to keep flaming in check.

Accurate & well communicated benchmark results both within each
variant/port of Lucene and across them is crucial for all of us making
iterative progress on performance.

> At some point I would like to port the benchmarking work that has  
> been contributed to Lucene of late, but I'm waiting for that code  
> base to settle down first.  After that happens, I'll probably make a  
> pass and publish some results.  Better to spend the time preparing  
> one definitive presentation than to have to rebut every idiot's  
> latest wildly inaccurate shootout.


> >> ... However, Lucene has been tuned by an army of developers over the
> >> years, while KS is young yet and still had many opportunities for
> >> optimization.  Current svn trunk for KS is about twice as fast for
> >> indexing as when I did those benchmarking tests.
> >
> > Wow, that's an awesome speedup!
> The big bottleneck for KS has been its Tokenizer class.  There's only  
> one such class in KS, and it's regex-based.  A few weeks ago, I  
> finally figured out how to hook it into Perl's regex engine at the C  
> level.  The regex engine is not an official part of Perl's C API, so  
> I wouldn't do this if I didn't have to, but the tokenizing loop is  
> only about 100 lines of code and the speedup is dramatic.

Tokenization is a very big part of Lucene's indexing time as well.

StandardAnalyzer is very time consuming.  When I switched to testing
with WhitespaceAnalyzer, it was quite a bit faster (I don't have exact
numbers).  Then when I created and switched to SimpleSpaceAnalyzer
(just splits on the space character, and, doesn't do new String(...)
for every token, instead makes offset+lenth slices into a char[]
array), it was even faster.

This is why "your mileage will vary" caveat is extremely important.
For most users of Lucene, I'd expect that 1) retrieving the doc from
whatever its source is, and 2) tokenizing, take a substantial amount
of time.  So the gains I'm seeing in my benchmarks won't usually be
seen by normal applications unless these applications have already
optimized their doc retrieval/tokenization.

And now that indexing each document is so fast, segment merging has
become a BIG part (66% in my "large index" test in LUCENE-856) of
indexing.  Marvin do you have any sense of what the equivalent cost is
in KS (I think for KS you "add" a previous segment not that
differently from how you "add" a document)?
> I've also squeezed out another 30-40% by changing the implementation  
> in ways which have gradually winnowed down the number of malloc()  
> calls.  Some of the techniques may be applicable to Lucene; I'll get  
> around to firing up JIRA issues describing them someday.

This generally was my approach in LUCENE-843 (minimize "new
Object()").  I re-use Posting objects, the hash for Posting objects,
byte buffers, etc.  I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.

The one thing that still baffles me is: I can't get a persistent
Posting hash to be any faster.  I still reset the Posting hash with
every document, but I had variants in my iterations that kept the
Postings hash between documents (just flushing the int[]'s
periodically).  I had expected that leaving Posting instances in the
hash, esp. for frequent terms, would be a win, but so far I haven't
seen that empirically.

> > So KS is faster than Lucene today?
> I haven't tested recent versions of Lucene.  I believe that the  
> current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
> But... A) I don't have an official release out with the current  
> Tokenizer code, B) I have no immediate plans to prepare further  
> published benchmarks, and C) it's not really important, because so  
> long as the numbers are close you'd be nuts to choose one engine or  
> the other based on that criteria rather than, say, what language your  
> development team speaks.  KinoSearch scales to multiple machines, too.

On C) I think it is important so the many ports of Lucene can "compare
notes" and "cross fertilize".  I agree for users of the ports of
Lucene this generally shouldn't be the primary factor for deciding
which one to use.

> Looking to the future, I wouldn't be surprised if Lucene edged ahead  
> and stayed slightly ahead speed-wise, because I'm prepared to make  
> some sacrifices for the sake of keeping KinoSearch's core API simple  
> and the code base as small as possible.  I'd rather maintain a  
> single, elegant, useful, flexible, plenty fast regex-based Tokenizer  
> than the slew of Tokenizers Lucene offers, for instance.  It might be  
> at a slight disadvantage going mano a mano against Lucene's  
> WhiteSpaceTokenizer, but that's fine.

That's a fair tradeoff.  Performance certainly isn't everything.  But
does KS give its users a choice in Tokenizer?  Or, can users
pre-tokenize their fields themselves?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message