lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: StandardTokenizer#setMaxTokenLength
Date Mon, 20 Jul 2015 18:00:20 GMT
Hi Piotr,

The behavior you mention is an intentional change from the behavior in Lucene 4.9.0 and earlier,
when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1] and LUCENE-5400[2].

The new behavior is as follows: Token matching rules are no longer allowed to match against
input char sequences longer than maxTokenLength.  If a rule that would match a sequence longer
than maxTokenLength, but also matches at maxTokenLength chars or fewer, and has the highest
priority among all other rules matching at this length, and no other rule matches more chars,
then a token will be emitted for that rule at the matching length.  And then the rule-matching
iteration simply continues from that point as normal.  If the same rule matches against the
remainder of the sequence that the first rule would have matched if maxTokenLength were longer,
then another token at the matched length will be emitted, and so on.  Note that this can result
in effectively splitting the sequence at maxTokenLength intervals as you noted.

I doubt ClassicAnalyzer has the same issue, since it isn’t built with the scanner buffer
limitation technique used when constructing StandardTokenizer and UAX29URLEmailTokenizer.

Steve

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400

> On Jul 20, 2015, at 4:21 AM, Piotr Idzikowski <piotridzikowski@gmail.com> wrote:
> 
> Hello.
> Btw, I think ClassicAnalyzer has the same problem
> 
> Regards
> 
> On Fri, Jul 17, 2015 at 4:40 PM, Steve Rowe <sarowe@gmail.com> wrote:
> 
>> Hi Piotr,
>> 
>> Thanks for reporting!
>> 
>> See https://issues.apache.org/jira/browse/LUCENE-6682
>> 
>> Steve
>> www.lucidworks.com
>> 
>>> On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski <piotridzikowski@gmail.com>
>> wrote:
>>> 
>>> Hello.
>>> I am developing own analyzer based on StandardAnalyzer.
>>> I realized that tokenizer.setMaxTokenLength is called many times.
>>> 
>>> *protected TokenStreamComponents createComponents(final String fieldName,
>>> final Reader reader) {*
>>> *    final StandardTokenizer src = new StandardTokenizer(getVersion(),
>>> reader);*
>>> *    src.setMaxTokenLength(maxTokenLength);*
>>> *    TokenStream tok = new StandardFilter(getVersion(), src);*
>>> *    tok = new LowerCaseFilter(getVersion(), tok);*
>>> *    tok = new StopFilter(getVersion(), tok, stopwords);*
>>> *    return new TokenStreamComponents(src, tok) {*
>>> *      @Override*
>>> *      protected void setReader(final Reader reader) throws IOException
>> {*
>>> *        src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);*
>>> *        super.setReader(reader);*
>>> *      }*
>>> *    };*
>>> *  }*
>>> 
>>> Does it make sense if length stays the same? I see it finally calls this
>>> one( in StandardTokenizerImpl ):
>>> *public final void setBufferSize(int numChars) {*
>>> *     ZZ_BUFFERSIZE = numChars;*
>>> *     char[] newZzBuffer = new char[ZZ_BUFFERSIZE];*
>>> *     System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
>>> Math.min(zzBuffer.length, ZZ_BUFFERSIZE));*
>>> *     zzBuffer = newZzBuffer;*
>>> *   }*
>>> So it just copies old array content into the new one.
>>> 
>>> Regards
>>> Piotr Idzikowski
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message