lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Token termBuffer issues
Date Tue, 24 Jul 2007 22:09:47 GMT
On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote:
> "Yonik Seeley" <yonik@apache.org> wrote:
> > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote:
> > > OK, I ran some benchmarks here.
> > >
> > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > PorterStemFilter.  I think it's worth pursuing!
> >
> > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > when creating new tokens from the tokenizer)?
>
> I haven't tried this variant yet.  I guess for long filter chains the
> GC cost of the tokenizer making the initial token should go down as
> overall part of the time.  Though I think we should still re-use the
> initial token since it should (?) only help.

If it weren't any slower, that would be great... but I worry about
filters that need buffering (either on the input side or the output
side) and how that interacts with filters that try and reuse.

Tokens that do output buffering could be slowed down if they must copy
the token state to the passed token.  I like Doron's idea that a new
Token could be returned anyway.

The extra complexity seemingly involved in trying to make both
scenarios perform well is what prompts me to ask what the performance
gain will be.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message