lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Thu, 05 Apr 2007 18:08:39 GMT

On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:

> Marvin do you have any sense of what the equivalent cost is
> in KS

It's big.  I don't have any good optimizations to suggest in this area.

> (I think for KS you "add" a previous segment not that
> differently from how you "add" a document)?

Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and  
the seg-at-a-time indexing strategy, segments don't get merged nearly  
as often as they do in Lucene.

> I share large int[] blocks and char[] blocks
> across Postings and re-use them.  Etc.

Interesting.  I will have to try something like that!

> On C) I think it is important so the many ports of Lucene can "compare
> notes" and "cross fertilize".

Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
the patch. ;)

Cross-fertilization is a powerful tool for stimulating algorithmic  
innovation.  Exhibit A: our unfolding collaborative successes.

That's why it was built into the Lucy proposal:

     [Lucy's C engine] will provide core, performance-critical
     functionality, but leave as much up to the higher-level
     language as possible.

Users from diverse communities approach problems from different  
angles and come up with different solutions.  The best ones will  
propagate across Lucy bindings.

The only problem is that since Dave Balmain has been much less  
available than we expected, it's been largely up to me to get Lucy to  
critical mass where other people can start writing bindings.

> Performance certainly isn't everything.

That's a given in scripting language culture.  Most users are  
concerned with minimizing developer time above all else.  Ergo, my  
emphasis on API design and simplicity.

> But does KS give its users a choice in Tokenizer?

You supply a regular expression which matches one token.

   # Presto! A WhiteSpaceTokenizer:
   my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
       token_re => qr/\S+/

> Or, can users pre-tokenize their fields themselves?

TokenBatch provides an API for bulk addition of tokens; you can  
subclass Analyzer to exploit that.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message