lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ere Maijala <ere.maij...@helsinki.fi>
Subject Re: Ignore accent in a request
Date Mon, 11 Feb 2019 08:57:49 GMT
Please note that mapping characters works well for a small set of
characters, but if you want full UNICODE normalization, take a look at
the ICUFoldingFilter:
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ICUFoldingFilter

--Ere

elisabeth benoit kirjoitti 8.2.2019 klo 22.47:
> yes you do
> 
> and use the char filter at index and query time
> 
> Le ven. 8 févr. 2019 à 19:20, SAUNIER Maxence <MSAUNIER@q1c1.fr> a écrit :
> 
>> For the charFilter, I need to reindex all documents ?
>>
>> -----Message d'origine-----
>> De : Erick Erickson <erickerickson@gmail.com>
>> Envoyé : vendredi 8 février 2019 18:03
>> À : solr-user <solr-user@lucene.apache.org>
>> Objet : Re: Ignore accent in a request
>>
>> Elisabeth's suggestion is spot on for the accent.
>>
>> One other thing I noticed. You are using KeywordTokenizerFactory combined
>> with EdgeNGramFilterFactory. This implies that you can't search for
>> individual _words_, only prefix queries, i.e.
>> je
>> je s
>> je su
>> je sui
>> je suis
>>
>> You can't search for "suis" for instance.
>>
>> basically this is an efficient way to search anything starting with
>> three-or-more letter prefixes at the expense of index size. You might be
>> better off just using wildcards (restrict to three letters at the prefix
>> though).
>>
>> This is perfectly valid, I'm mostly asking if it's your intent.
>>
>> Best,
>> Erick
>>
>> On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence <MSAUNIER@q1c1.fr> wrote:
>>>
>>> Thanks you !
>>>
>>> -----Message d'origine-----
>>> De : elisabeth benoit <elisaelisaelisa@gmail.com> Envoyé : vendredi 8
>>> février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore
>>> accent in a request
>>>
>>> Hello,
>>>
>>> We use solr 7 and use
>>>
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>
>>> with mapping-ISOLatin1Accent.txt
>>>
>>> containing lines like
>>>
>>> # À => A
>>> "\u00C0" => "A"
>>>
>>> # Á => A
>>> "\u00C1" => "A"
>>>
>>> # Â => A
>>> "\u00C2" => "A"
>>>
>>> # Ã => A
>>> "\u00C3" => "A"
>>>
>>> # Ä => A
>>> "\u00C4" => "A"
>>>
>>> # Å => A
>>> "\u00C5" => "A"
>>>
>>> # Ā Ă Ą =>
>>> "\u0100" => "A"
>>> "\u0102" => "A"
>>> "\u0104" => "A"
>>>
>>> # Æ => AE
>>> "\u00C6" => "AE"
>>>
>>> # Ç => C
>>> "\u00C7" => "C"
>>>
>>> # é => e
>>> "\u00E9" => "e"
>>>
>>> Best regards,
>>> Elisabeth
>>>
>>> Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma <Gopesh_Sharma@gensler.com>
>> a écrit :
>>>
>>>> We have fixed this type of issue by using Synonyms by adding
>>>> SynonymFilterFactory(Before Solr 7).
>>>>
>>>> -----Original Message-----
>>>> From: SAUNIER Maxence <MSAUNIER@q1c1.fr>
>>>> Sent: Friday, February 8, 2019 3:36 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: RE: Ignore accent in a request
>>>>
>>>> Hello,
>>>>
>>>> Thanks for you answer.
>>>>
>>>> I have test :
>>>>
>>>> select?defType=dismax&q=je suis avarié&qf=content
>>>> 90.000 results
>>>>
>>>> select?defType=dismax&q=je suis avarie&qf=content
>>>> 60.000 results
>>>>
>>>> With avarié, I dont find documents with avarie and with avarie, I
>>>> don't find documents with avarié.
>>>>
>>>> I want to find they 150.000 documents with avarié or avarie.
>>>>
>>>> Thanks
>>>>
>>>> -----Message d'origine-----
>>>> De : Erick Erickson <erickerickson@gmail.com> Envoyé : jeudi 7
>>>> février
>>>> 2019 19:37 À : solr-user <solr-user@lucene.apache.org> Objet : Re:
>>>> Ignore accent in a request
>>>>
>>>> exactly _how_ is it "not working"?
>>>>
>>>> Try building your parameters _up_ rather than starting with a lot, e.g.
>>>> select?defType=dismax&q=je suis avarié&qf=title ^^ assumes you
>>>> expect a match on title. Then:
>>>> select?defType=dismax&q=je suis avarié&qf=title subject
>>>>
>>>> etc.
>>>>
>>>> Because mm=757 looks really wrong. From the docs:
>>>> Defines the minimum number of clauses that must match, regardless of
>>>> how many clauses there are in total.
>>>>
>>>> edismax is used much more than dismax as it's more flexible, but
>>>> that's not germane here.
>>>>
>>>> finally, try adding &debug=query to the url to see exactly how the
>>>> query is parsed.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence <MSAUNIER@q1c1.fr>
>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> How can I ignore accent in the query result ?
>>>>>
>>>>> Request :
>>>>> http://*****:8983/solr/***/select?defType=dismax&q=je+suis+avarié&
>>>>> qf
>>>>> =t
>>>>> itle%5e20+subject%5e15+category%5e1+content%5e0.5&mm=757
>>>>>
>>>>> I want to have doc with avarié and avarie.
>>>>>
>>>>> I have add this in my schema :
>>>>>
>>>>>   {
>>>>>     "name": "string",
>>>>>     "positionIncrementGap": "100",
>>>>>     "analyzer": {
>>>>>       "filters": [
>>>>>         {
>>>>>           "class": "solr.LowerCaseFilterFactory"
>>>>>         },
>>>>>         {
>>>>>           "class": "solr.ASCIIFoldingFilterFactory"
>>>>>         },
>>>>>         {
>>>>>           "class": "solr.EdgeNGramFilterFactory",
>>>>>           "minGramSize": "3",
>>>>>           "maxGramSize": "50"
>>>>>         }
>>>>>       ],
>>>>>       "tokenizer": {
>>>>>         "class": "solr.KeywordTokenizerFactory"
>>>>>       }
>>>>>     },
>>>>>     "stored": true,
>>>>>     "indexed": true,
>>>>>     "sortMissingLast": true,
>>>>>     "class": "solr.TextField"
>>>>>   },
>>>>>
>>>>> But it not working.
>>>>>
>>>>> Thanks.
>>>>
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Mime
View raw message