lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Gazzarini <gxs...@gmail.com>
Subject Issue in the analysis chain
Date Fri, 02 Dec 2016 11:00:13 GMT
Hi,
I found a strange behavior with the MappingCharFilterFactory in Solr 
*6.2.1*. Definitely curious if I'm missing something or someone else met 
that.

I have a (index and query) chain composed as follows:

<charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory" />
...

The mapping-FoldToASCII.txt is the exact file that you can find in the 
Solr download bundle, I didn't add any mapping.
I started having some search issues and after checking, I saw that some 
characters with diacritics weren't replaced. I isolated one of those 
cases and tried to see what's happen in the analysis page.

As expected, the characters weren't replaced so I tried char by char. 
Nothing, it doesn't work.
An example

I pasted īà in the "Field Value (Index)" box. The *ī* char is the 
unicode *\u012b* which is already mapped in the mapping-FoldToASCII.txt

Without the "Verbose Output" flag [1]

  * I see an empty space beside the MCF (where instead I'd expect to see
    the "i", "a" replaced characters)
  * the KeywordTokenizer reports exactly my input "īà" so it seems the
    MCF didn't make any change to the source input

However, if I turn the "Verbose Output" flag on [2]

  * You can see that the MCF is working (i.e. ī becomes i, and à becomes a)
  * But the KeywordTokenizer is still ignoring that and it produces īà

I tried the same with a Solr 4.7.1 instance and as you can see [3] it 
works as I would expect

Any help would be warmly appreciated

Best,
Andrea

[1] https://drive.google.com/file/d/0B82QaJKoMzvWb3dLcW80ME5wdXc/view
[2] https://drive.google.com/file/d/0B82QaJKoMzvWN2lNSF9JQUhPZ3c/view
[3] https://drive.google.com/file/d/0B82QaJKoMzvWeHRzUnU3MGFtY2s/view

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message