lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Damson <stephan.damson....@bayer.com>
Subject SOLR Tokenizer “solr.SimplePatternSplitTokenizerFactory” splits at unexpected characters
Date Tue, 26 Feb 2019 07:18:59 GMT
Hi!

I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern
used is actually from an example in the SOLR documentation and I do not understand where I
made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input
gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters
"r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also
tried to use more than one backspace in the pattern (e.g. \t and \\t<file:///\\t>),
but this did not change how the input is tokenized during indexing.

What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Many thanks in advance for any help you can provide!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message