lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Token termBuffer issues
Date Wed, 25 Jul 2007 09:59:41 GMT
"Yonik Seeley" <> wrote:
> On 7/24/07, Michael McCandless <> wrote:
> > "Yonik Seeley" <> wrote:
> > > On 7/24/07, Michael McCandless <> wrote:
> > > > OK, I ran some benchmarks here.
> > > >
> > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > PorterStemFilter.  I think it's worth pursuing!
> > >
> > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > when creating new tokens from the tokenizer)?
> >
> > I haven't tried this variant yet.  I guess for long filter chains the
> > GC cost of the tokenizer making the initial token should go down as
> > overall part of the time.  Though I think we should still re-use the
> > initial token since it should (?) only help.
> If it weren't any slower, that would be great... but I worry about
> filters that need buffering (either on the input side or the output
> side) and how that interacts with filters that try and reuse.

OK I will tease out this effect & measure performance impact.

This would mean that the tokenizer must not only produce new Token
instance for each term but also cannot re-use the underlying char[]
buffer in that token, right?  EG with mods for CharTokenizer I re-use
its "char[] buffer" with every Token, but I'll change that to be a new
buffer for each Token for this test.

> Tokens that do output buffering could be slowed down if they must copy
> the token state to the passed token.  I like Doron's idea that a new
> Token could be returned anyway.
> The extra complexity seemingly involved in trying to make both
> scenarios perform well is what prompts me to ask what the performance
> gain will be.

Yes I like Doron's idea too -- it's just a "suggestion" to use the
input Token if it's convenient.

I think the resulting API is fairly simple with this change: if you
(the consumer) want a "full private copy" of the Token (like
QueryParser, Highlighter, CachedTokenFilter, a filter that does input
buffering, etc.) you call the call.  If instead you can
handle re-use because you will consume this Token once, right now, and
never look at it again (like DocumentsWriter), then you call the
next(Token) API.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message