lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven White <swhite4...@gmail.com>
Subject Re: Default stop word list
Date Fri, 26 Aug 2016 13:13:42 GMT
But what about the current "default" list that comes with Solr?  How was
that list, for all supported languages, determined?

What I fear is this, when someone puts Solr into production, no one makes a
change to that list, so if the list is not "valid" this will impacting
search, but if the list is valid, how was it determined, just by the
development team of Solr / Lucene or input from linguistic expert?

Steve

On Fri, Aug 26, 2016 at 2:25 AM, Srinivasa Meenavalli <Smeenavali@zensar.com
> wrote:

> Hi Steven,
>
> List of Stopwords of a language are not fixed, there is no single
> universal list of stop words used by all natural language processing tools .
> Ideally stop words should be defined search merchandisers based on their
> domain instead of referring default.
>
> https://en.wikipedia.org/wiki/Stop_words
>
> You are allowed to add  lang/stopwords_<languagecode>.txt
>
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>       <filter class="solr.PorterStemFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.SynonymFilterFactory" expand="true"
> synonyms="synonyms.txt" ignoreCase="true"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>       <filter class="solr.PorterStemFilterFactory"/>
>     </analyzer>
>
> Regards
> Srinivas Meenavalli
>
> -----Original Message-----
> From: Steven White [mailto:swhite4141@gmail.com]
> Sent: Friday, August 26, 2016 4:02 AM
> To: solr-user@lucene.apache.org
> Subject: Default stopword list
>
> Hi everyone,
>
> I'm curious, the current "default" stopword list, for English and other
> languages, how was it determined?  And for English, why "I" is not in the
> stopword list?
>
> Thanks in advanced.
>
> Steve
> Disclaimer: The contents of this e-mail and attachment(s) thereto are
> confidential and intended for the named recipient(s) only. It shall not
> attach any liability on the originator or Zensar Technologies Limited or
> its affiliates. Any views or opinions presented in this email are solely
> those of the author and may not necessarily reflect the opinions of Zensar
> Technologies Limited or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification, distribution and / or
> publication of this message without the prior written consent of the author
> of this e-mail is strictly prohibited. If you have received this email in
> error please delete it and notify the sender immediately. Before opening
> any mail and attachments please check them for viruses and defect. Zensar
> Technologies Ltd or its affiliate do not accept any liability for virus
> infected mails.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message