lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Default stop word list
Date Tue, 30 Aug 2016 01:18:41 GMT
Do not remove stop words. Want to search for “vitamin a”? That won’t work.

Stop word removal is a hack left over from when we were running search engines in 64 kbytes
of memory.

Yes, common words are less important for search, but removing them is a brute force approach
with severe side effects. Instead, we use a proportional approach with the tf.idf model. That
puts a higher weight on rare words and a lower weight on common words.

For some real-life examples of problems with stop words, you can read the list of movie titles
that disappear with stemming and stop words. I discovered these when I was running search
at Netflix.

	• Being There (this is the first one I noticed)
	• To Be and To Have (Être et Avoir)
	• To Have and To Have Not
	• Once and Again
	• To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
	• To Be or Not To Be (1983)
	• Now and Then, Here and There
	• Be with Me
	• I’ll Be There
	• It Had to Be You
	• You Should Not Be Here
	• You Are Here

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 29, 2016, at 5:39 PM, Steven White <swhite4141@gmail.com> wrote:
> 
> Thanks Shawn.  This is the best answer I have seen, much appreciated.
> 
> A follow up question, I want to remove stop words from the list, but if I
> do, then search quality will degradation (and index size will grow (less of
> an issue)).  For example, if I remove "a", then if someone search for "For
> a Few Dollars More" (without quotes) chances are good records with "a" will
> land higher up that are not relevant to user's search.  How can I address
> this?  Can I setup my schema so that records that get hits against a list
> of words, let's say off the stop word list, are ranked lower?
> 
> Steve
> 
> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <apache@elyograg.org> wrote:
> 
>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>> I personally think that stopword removal is more of a problem than a
>>> solution.
>> 
>> There actually is one thing that a stopword filter can dothat has little
>> to do with the purpose it was designed for.  You can make it impossible
>> to search for certain words.
>> 
>> Imagine that your original data contains the word "frisbee" but for some
>> reason you do not want anybody to be able to locate results using that
>> word.  You can create a stopword list containing just "frisbee" and any
>> other variations that you want to limit like "frisbees", then place it
>> as a filter on the index side of your analysis.  With this in place,
>> searching for those terms will retrieve zero results.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Mime
View raw message