lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <s...@elyograg.org>
Subject Re: StopWords coming in Top 10 terms despite using StopFilterFactory
Date Thu, 22 Sep 2011 18:34:56 GMT
On 9/22/2011 3:54 AM, Pranav Prakash wrote:
> Hi List,
>
> I included StopFilterFactory and I  can see it taking action in the Analyzer
> Interface. However, when I go to Schema Analyzer, I see those stop words in
> the top 10 terms. Is this normal?
>
> <fieldType name="text_commongrams" class="solr.TextField">
> <analyzer>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.TrimFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase
> ="true" expand="true"/>
> <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="
> true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0
> "preserveOriginal="1"/>
> </analyzer>
> </fieldType>


You've got CommonGramsFilterFactory and StopFilterFactory both using 
stopwords.txt, which is a confusing configuration.  Normally you'd want 
one or the other, not both ... but if you did legitimately have both, 
you'd want them to each use a different wordlist.

The commongrams filter turns each found occurrence of a word in the file 
into two tokens - one prepended with the token before it, one appended 
with the token after it.  If it's the first or last term in a field, it 
only produces one token.  When it gets to the stopfilter, the combined 
terms no longer match what's in stopwords.txt, so no action is taken.

If I had to guess, what you are seeing in the top 10 terms is the 
concatenation of your most common stopword with another word.  If it 
were English, I would guess that to be "of_the" or something similar.  
If my guess is wrong, then I'm not sure what's going on, and some 
cut/paste of what you're actually seeing might be in order.  Did you do 
delete and do a full reindex after you changed your schema?

Thanks,
Shawn


Mime
View raw message