lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Arabic words search in solr
Date Wed, 02 Aug 2017 15:42:32 GMT
+1

I was hoping to use this as a case for arguing for turning off an overly aggressive stemmer,
but I checked on your 10 docs and query, and David is right, of course -- if you change the
default operator to AND, you only get the one document back that you had intended to.

I can still use this as a case for getting on my Unicode normalization soapbox and +1'ing
your use of the ICUFoldingFilter.  With no token filters, you get 4 results; when you add
the ICUFoldingFilter, you get 8 results; and when you add in the Arabic stemmer, you get all
10.  Not that you need this, but see slide 33 of [1], where we show 78 Unicode variants for
"America" in ~800k docs in an Arabic script language.  Without Unicode normalization, users
might get 1/2 the documents back or far, far fewer...and they wouldn't even know what they
were missing!

[1] https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf

-----Original Message-----
From: David Hastings [mailto:hastings.recursive@gmail.com] 
Sent: Wednesday, August 2, 2017 9:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

perhaps change your default operator to AND instead of OR if thats what you are expecting
for a result

On Wed, Aug 2, 2017 at 8:57 AM, mohanmca01 <mohanmca01@gmail.com> wrote:

> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, 
> but it was getting wrong results. i was searching for "شرطة ازكي" and 
> it was showing me the result that am looking for, plus irrelevant 
> result which either have the first or second word that i have typed while searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 3,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(شرطة ازكي)",
>       "_": "1501678260335",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 444,
>     "start": 0,
>     "docs": [
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة
الداخلية  
> -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>       {
>         "id": "13937",
>         "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
>         "_version_": 1574621132197200000
>       },
>       {
>         "id": "15914",
>         "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
>         "_version_": 1574621132344000500
>       },
>       {
>         "id": "20639",
>         "bizNameAr": "سحائب ازكي للتجارة",
>         "_version_": 1574621132574687200
>       },
>       {
>         "id": "25108",
>         "bizNameAr": "المستشفيات -  - مستشفى إزكي",
>         "_version_": 1574621132737216500
>       },
>       {
>         "id": "27629",
>         "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
>         "_version_": 1574621132833685500
>       },
>       {
>         "id": "36351",
>         "bizNameAr": "طوارئ الكهرباء - إزكي",
>         "_version_": 1574621133183910000
>       },
>       {
>         "id": "61235",
>         "bizNameAr": "اضواء ازكي للتجارة",
>         "_version_": 1574621133785792500
>       },
>       {
>         "id": "66821",
>         "bizNameAr": "أطلال إزكي للتجارة",
>         "_version_": 1574621133915816000
>       },
>       {
>         "id": "67011",
>         "bizNameAr": "بنك ظفار - فرع ازكي",
>         "_version_": 1574621133920010200
>       }
>     ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the 
> words that we typed while searching:
>
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة
الداخلية  
> -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
>     <field name="bizNameAr" type="text_ar" indexed="true" 
> stored="true"/>
>
>
>     <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>         <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>                 <filter class="solr.ICUFoldingFilterFactory"/>
>                 <filter class="solr.HyphenatedWordsFilterFactory"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ى"
> replacement="ئ"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ء"
> replacement=""/>
>       </analyzer>
>     </fieldType>
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Mime
View raw message