lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Modassar Ather <modather1...@gmail.com>
Subject Regarding HTMLStripCharFilter.
Date Tue, 02 Aug 2016 04:00:55 GMT
Hi,

Kindly help me understand the way HTMLStripCharFilter works.

I have following analysis chain.

int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
        | WordDelimiterFilter.GENERATE_NUMBER_PARTS
        | WordDelimiterFilter.CATENATE_WORDS
        | WordDelimiterFilter.CATENATE_NUMBERS
        | WordDelimiterFilter.CATENATE_ALL
        | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
        | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
        | WordDelimiterFilter.PRESERVE_ORIGINAL;

    @Override
    protected Reader initReader(String field, Reader reader) {
        return new HTMLStripCharFilter(reader);
    }

    @Override
    protected TokenStreamComponents createComponents(String arg0) {
        Tokenizer source = new WhitespaceTokenizer();
        TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
null);
        TokenStream rdtStream = new
RemoveDuplicatesTokenFilter(wordDMTStrem);

        return new TokenStreamComponents(source, rdtStream);
    }

*teRm<sub>3</sub>* returns following analyzed tokens by above analysis
chain.

*Text       Position Increment    Position Length      Offset attribute*
teRm3   1                                1                               0,
16
Rm3      1                                1
0, 16
te          0                                1
                       0, 16
teRm3   0                                1                               0,
16

Here in the above table teRm3 has occurred twice but not removed by
RemoveDuplicatesTokenFilter.

Whereas *teRm3* gets tokenized with the same analysis chain as below .

*Text      Position Increment    Position Length    Offset attribute*
teRm3   1                               1                           0, 5
te          0                               1                           0, 2
Rm3      1                               1                           2, 5

Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
no duplicate for it.

Please share your comments on this difference in behavior of analysis.

Thanks,
Modassar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message