lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet
Date Sun, 25 Sep 2011 17:50:26 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114300#comment-13114300
] 

Uwe Schindler commented on LUCENE-2279:
---------------------------------------

You misunderstood the response: StopFilter indeed did not change. The change is now that in
Lucene 4.0 all Analyzers are required to reuse TokenStream instances, so the StopFilter is
only produced only once in your application (when the Analyzer is created).

> eliminate pathological performance on StopFilter when using a Set<String> instead
of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very
slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which ends up calling
the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all
the elements of the set are copied to a CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords,
boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying CharArraySet
vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter
as they all result in a copy for each invocation of Analyzer.tokenStream().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message