lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From snkar <soumya....@zoho.com>
Subject Re: Stemming query in Solr
Date Mon, 01 Jul 2013 09:15:59 GMT
Hi Erick,

Thanks for the reply.

Here is what the situation is:

Relevant portion of Solr Schema:
&lt;field name="Content" type="text_general" indexed="false" stored="true" required="true"/&gt;
&lt;field name="ContentSearch" type="text_general" indexed="true" stored="false" multiValued="true"/&gt;
&lt;field name="ContentSearchStemming" type="text_stem" indexed="true" stored="false"
multiValued="true"/&gt;
&lt;copyField source="Content" dest="ContentSearch"/&gt;
&lt;copyField source="Content" dest="ContentSearchStemming"/&gt;

&lt;fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"&gt;
&lt;analyzer type="index"&gt; &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
&lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/&gt; &lt;filter class="solr.LowerCaseFilterFactory"/&gt; &lt;/analyzer&gt;
&lt;analyzer type="query"&gt; &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
&lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"
/&gt; &lt;filter class="solr.LowerCaseFilterFactory"/&gt; &lt;/analyzer&gt;
&lt;/fieldType&gt;

&lt;fieldType name="text_stem" class="solr.TextField" &gt; &lt;analyzer&gt;
&lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt; &lt;filter class="solr.SnowballPorterFilterFactory"/&gt;
&lt;/analyzer&gt; &lt;/fieldType&gt;
When I am indexing a document, the content gets stored as is in the Content field and gets
copied over to ContentSearch and ContentSearchStemming for text based search and stemming
search respectively. So, the ContentSearchStemming field does store the
stem/reduced form of the terms. I have checked this with the Luke as well as the Admin Schema
Browser --&gt; Term Info. In the Admin
Analysis screen, I have tested and found that if I index the text "burning", it gets reduced
to and stored as "burn". So far so good.

Now in the UI, 
lets say the user puts in the term "burn" and checks the stemming option. The expectation
is that since the user has specified stemming, the results should be returned for the term
"burn" as well as for all terms which has their stem as "burn" i.e. burning, burned, burns,
etc.
lets say the user puts in the term "burning" and checks the stemming option. The expectation
is that since the user has specified stemming, the results should be returned for the term
"burning" as well as for all terms which has their stem as "burn" i.e. burn, burned, burns,
etc.
The query that gets submitted to Solr: q=ContentSearchStemming:burning
>From Debug Info: 
&lt;str name="rawquerystring"&gt;ContentSearchStemming:burning&lt;/str&gt;
&lt;str name="querystring"&gt;ContentSearchStemming:burning&lt;/str&gt;
&lt;str name="parsedquery"&gt;ContentSearchStemming:burn&lt;/str&gt;
&lt;str name="parsedquery_toString"&gt;ContentSearchStemming:burn&lt;/str&gt;
So, when the results are returned, I am only getting the hits highlighted with the term "burn",
though the same document contains terms like burning and 
burns.

I thought that the stemming should work like this: 
The stemming filter in the queryanalyzer chain would reduce the input word to its stem. burning
--&gt; burn
The query component should scan through the terms and match those terms for which it finds
a match between the stem of the term with the stem of the input term. burns --&gt; burn
(matches) burning --&gt; burn
The first point is happening. But looks like it is executing the search for an exact text
based match with the stem "burn". Hence, burns or burned are not getting returned.
Hope I was able to make myself clear.

---- On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] &lt;ml-node+s472066n4073901h29@n3.nabble.com&gt;
wrote ---- 


 First, this is for the Java version, I hope it extends to C#. 

But in your configuration, when you're indexing the stemmer 
should be storing the reduced form in the index. Then, when 
searching, the search should be against the reduced term. 
To check this, try 
1&gt; Using the Admin/Analysis page to see what gets stored 
     in your index and what your query is transformed to to 
     insure that you're getting what you expect. 

If you want to get in deeper to the details, try 
1&gt; use, say, the TermsComponent or Admin/Schema Browser 
     or Luke to look in your index and see what's actually 
    there. 
2&gt; us &amp;debug=query or Admin/Analysis to see what the query 
    actually looks like. 

Both your use-cases should work fine just with reduction 
_unless_ the particular word you look for doesn't happen to 
trip the stemmer. By that I mean that since it's algorithmically 
based, there may be some edge cases that seem like they 
should be reduced that aren't. I don't know whether "fisherman" 
would reduce to "fish" for instance. 

So are you seeing things that really don't work as expected or 
are you just working from the docs? Because I really don't 
see why you wouldn't get what you want given your description. 

Best 
Erick 


On Fri, Jun 28, 2013 at 2:33 AM, snkar &lt;[hidden email]&gt; wrote: 

&gt; We have a search system based on Solr using the Solrnet library in C# which 
&gt; supports some advanced search features like Fuzzy, Synonym and Stemming. 
&gt; While all of these work, *the expectation from the Stemming Search seems to 
&gt; be a combination of Stemming by reduction as well as stemming by expansion 
&gt; to cover grammatical variations on a word*. A use case will make it more 
&gt; clear: 
&gt; 
&gt;  - a search for fish would also find fishing 
&gt;  - a search for applied would also find applying, applies, and apply 
&gt; 
&gt; We had implemented Stemming using a CopyField with 
&gt; SnowballPorterFilterFactory. *As a result, when /searching for burning the 
&gt; results are returning for burning and burn/ but when /searching for burn 
&gt; the 
&gt; results are not returning for burning or burnt or burns/* 
&gt; 
&gt; Since all stemmers supported Lucene/Solr all use stemming by reduction, we 
&gt; are not sure on how to go about this. As per the Solr Wiki: 
&gt; 
&gt; &gt; A related technology to stemming is lemmatization, which allows for 
&gt; &gt; "stemming" by expansion, taking a root word and 'expanding' it to all of

&gt; &gt; its various forms. Lemmatization can be used either at insertion time or

&gt; &gt; at query time. Lucene/Solr does not have built-in support for 
&gt; &gt; lemmatization but it can be simulated by using your own dictionaries and

&gt; &gt; the SynonymFilterFactory 
&gt; 
&gt; We are not sure of exactly how to go about this in Solr. Any ideas. 
&gt; 
&gt; We were also thinking in terms of using some C# based stemmer/lemmatizer 
&gt; library to get the root of the word and using some public database like 
&gt; WordNet to extract the different grammatical variations of the stem and 
&gt; then 
&gt; send across all these terms for querying in Solr. We have not yet done a 
&gt; lot 
&gt; of research to figure out a stable C# stemmer/lemmatizer and a WordNet C# 
&gt; API, but seems like this will get too convoluted and it should have a way 
&gt; to 
&gt; be executed from within Solr. 
&gt; 
&gt; 
&gt; 
&gt; -- 
&gt; View this message in context: 
&gt; http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862.html
&gt; Sent from the Solr - User mailing list archive at Nabble.com. 
&gt; 

 
 
   If you reply to this email, your message will be added to the discussion below:
 http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4073901.html 
  To unsubscribe from Stemming query in Solr, click here.
 NAML 






--
View this message in context: http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4074283.html
Sent from the Solr - User mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message