lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Stemming query in Solr
Date Tue, 02 Jul 2013 11:30:00 GMT
Somehow we're mis-communicating here. Forget expansion,
it's all about base forms. <G>.

bq: What I cannot figure out is how is this going to help me in instructing
Solr to execute the query for the different grammatical variations of the
input search term stem

You don't. You search the stemmed input against the stemmed
field (happens automatically by field).

So, getting hits on burn, burns, burned, burning when searching
for "burning",  because both the query and index process are
working with "burn". Note that the _stored_ values that get returned with
the fields are all the originals, so you see burns, burning, etc.

Your query searches against one or the other field depending
on whether you have the "exact match" checkbox checked or
not. You can even do a variant of searching on _both_ with
a high boos on the exact_match field which would _tend_ to
sort the documents with exact match to the top of the list.

Best
Erick


On Mon, Jul 1, 2013 at 9:12 AM, snkar <soumya.kar@zoho.com> wrote:

> I was just wondering if another solution might work. If we are able to
> extract the stem of the input search term(maybe using a C# based stemmer,
> some open source implementation of the Porter algorithm) for cases where
> the stemming option is selected, and submit the query to solr as a multiple
> character wild card query with respect to the stem, it should return me all
> the different variations of the stemmed word.
>
> Example:
>
> Search Term: burning
> Stem: burn
> Modified Query: burn*
> Results: burn, burning, burns, burnt, etc.
>
> I am sure this is not the proper way of executing a stemming by expansion,
> but this might just get the job done. What do you think? Trying to think of
> test case where this will fail.
>
> ---- On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene]&
> lt;ml-node+s472066n4074311h39@n3.nabble.com&gt; wrote ----
>
>
>  bq:  But looks like it is executing the search for an exact text based
> match with the stem "burn".
>
> Right. You need to appreciate index time as opposed to query time stemming.
> Your field
> definition has both turned on. The admin/analysis page will help here
> &lt;G&gt;..
>
> At index time, the terms are stemmed, and _only_ the reduced term is put in
> the index.
> At query time, the same thing happens and _only_ the reduced term is
> searched for.
>
> By stemming at index time, you lose the original form of the word, it's
> just gone and
> nothing about checking/unchecking the "stem" bits will recover it. So the
> general
> solution is to index the field twice, once with stemming and once without
> in order
> to have the ability to do both stemmed and exact matches. I think I saw a
> clever
> approach to doing this involving a custom filter but can't find it now. As
> I recall it
> indexed the un-stemmed version like a synonym with some kind of marker
> to indicate exact match when necessary....
>
> Best
> Erick
>
>
> On Mon, Jul 1, 2013 at 5:15 AM, snkar &lt;[hidden email]&gt; wrote:
>
> &gt; Hi Erick,
> &gt;
> &gt; Thanks for the reply.
> &gt;
> &gt; Here is what the situation is:
> &gt;
> &gt; Relevant portion of Solr Schema:
> &gt; &amp;lt;field name="Content" type="text_general" indexed="false"
> stored="true"
> &gt; required="true"/&amp;gt;
> &gt; &amp;lt;field name="ContentSearch" type="text_general" indexed="true"
> &gt; stored="false" multiValued="true"/&amp;gt;
> &gt; &amp;lt;field name="ContentSearchStemming" type="text_stem"
> indexed="true"
> &gt; stored="false" multiValued="true"/&amp;gt;
> &gt; &amp;lt;copyField source="Content" dest="ContentSearch"/&amp;gt;
> &gt; &amp;lt;copyField source="Content"
> dest="ContentSearchStemming"/&amp;gt;
> &gt;
> &gt; &amp;lt;fieldType name="text_general" class="solr.TextField"
> &gt; positionIncrementGap="100"&amp;gt; &amp;lt;analyzer
> type="index"&amp;gt; &amp;lt;tokenizer
> &gt; class="solr.StandardTokenizerFactory"/&amp;gt; &amp;lt;filter
> &gt; class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
> &gt; enablePositionIncrements="true" /&amp;gt; &amp;lt;filter
> &gt; class="solr.LowerCaseFilterFactory"/&amp;gt;
> &amp;lt;/analyzer&amp;gt; &amp;lt;analyzer
> &gt; type="query"&amp;gt; &amp;lt;tokenizer
> class="solr.StandardTokenizerFactory"/&amp;gt;
> &gt; &amp;lt;filter class="solr.StopFilterFactory" ignoreCase="true"
> &gt; words="stopwords.txt" enablePositionIncrements="true" /&amp;gt;
> &amp;lt;filter
> &gt; class="solr.LowerCaseFilterFactory"/&amp;gt; &amp;lt;/analyzer&amp;gt;
> &gt; &amp;lt;/fieldType&amp;gt;
> &gt;
> &gt; &amp;lt;fieldType name="text_stem" class="solr.TextField" &amp;gt;
> &gt; &amp;lt;analyzer&amp;gt; &amp;lt;tokenizer
> class="solr.WhitespaceTokenizerFactory"/&amp;gt;
> &gt; &amp;lt;filter class="solr.SnowballPorterFilterFactory"/&amp;gt;
> &amp;lt;/analyzer&amp;gt;
> &gt; &amp;lt;/fieldType&amp;gt;
> &gt; When I am indexing a document, the content gets stored as is in the
> &gt; Content field and gets copied over to ContentSearch and
> &gt; ContentSearchStemming for text based search and stemming search
> &gt; respectively. So, the ContentSearchStemming field does store the
> &gt; stem/reduced form of the terms. I have checked this with the Luke as
> well
> &gt; as the Admin Schema Browser --&amp;gt; Term Info. In the Admin
> &gt; Analysis screen, I have tested and found that if I index the text
> &gt; "burning", it gets reduced to and stored as "burn". So far so good.
> &gt;
> &gt; Now in the UI,
> &gt; lets say the user puts in the term "burn" and checks the stemming
> option.
> &gt; The expectation is that since the user has specified stemming, the
> results
> &gt; should be returned for the term "burn" as well as for all terms which
> has
> &gt; their stem as "burn" i.e. burning, burned, burns, etc.
> &gt; lets say the user puts in the term "burning" and checks the stemming
> &gt; option. The expectation is that since the user has specified
> stemming, the
> &gt; results should be returned for the term "burning" as well as for all
> terms
> &gt; which has their stem as "burn" i.e. burn, burned, burns, etc.
> &gt; The query that gets submitted to Solr: q=ContentSearchStemming:burning
> &gt; From Debug Info:
> &gt; &amp;lt;str
> name="rawquerystring"&amp;gt;ContentSearchStemming:burning&amp;lt;/str&amp;gt;
> &gt; &amp;lt;str
> name="querystring"&amp;gt;ContentSearchStemming:burning&amp;lt;/str&amp;gt;
> &gt; &amp;lt;str
> name="parsedquery"&amp;gt;ContentSearchStemming:burn&amp;lt;/str&amp;gt;
> &gt; &amp;lt;str
> &gt;
> name="parsedquery_toString"&amp;gt;ContentSearchStemming:burn&amp;lt;/str&amp;gt;
> &gt; So, when the results are returned, I am only getting the hits
> highlighted
> &gt; with the term "burn", though the same document contains terms like
> burning
> &gt; and
> &gt; burns.
> &gt;
> &gt; I thought that the stemming should work like this:
> &gt; The stemming filter in the queryanalyzer chain would reduce the input
> word
> &gt; to its stem. burning --&amp;gt; burn
> &gt; The query component should scan through the terms and match those
> terms
> &gt; for which it finds a match between the stem of the term with the stem
> of
> &gt; the input term. burns --&amp;gt; burn (matches) burning --&amp;gt;
> burn
> &gt; The first point is happening. But looks like it is executing the
> search
> &gt; for an exact text based match with the stem "burn". Hence, burns or
> burned
> &gt; are not getting returned.
> &gt; Hope I was able to make myself clear.
> &gt;
> &gt; ---- On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene]
> &amp;
> &gt; lt;[hidden email]&amp;gt; wrote ----
> &gt;
> &gt;
> &gt;  First, this is for the Java version, I hope it extends to C#.
> &gt;
> &gt; But in your configuration, when you're indexing the stemmer
> &gt; should be storing the reduced form in the index. Then, when
> &gt; searching, the search should be against the reduced term.
> &gt; To check this, try
> &gt; 1&amp;gt; Using the Admin/Analysis page to see what gets stored
> &gt;      in your index and what your query is transformed to to
> &gt;      insure that you're getting what you expect.
> &gt;
> &gt; If you want to get in deeper to the details, try
> &gt; 1&amp;gt; use, say, the TermsComponent or Admin/Schema Browser
> &gt;      or Luke to look in your index and see what's actually
> &gt;     there.
> &gt; 2&amp;gt; us &amp;amp;debug=query or Admin/Analysis to see what the
> query
> &gt;     actually looks like.
> &gt;
> &gt; Both your use-cases should work fine just with reduction
> &gt; _unless_ the particular word you look for doesn't happen to
> &gt; trip the stemmer. By that I mean that since it's algorithmically
> &gt; based, there may be some edge cases that seem like they
> &gt; should be reduced that aren't. I don't know whether "fisherman"
> &gt; would reduce to "fish" for instance.
> &gt;
> &gt; So are you seeing things that really don't work as expected or
> &gt; are you just working from the docs? Because I really don't
> &gt; see why you wouldn't get what you want given your description.
> &gt;
> &gt; Best
> &gt; Erick
> &gt;
> &gt;
> &gt; On Fri, Jun 28, 2013 at 2:33 AM, snkar &amp;lt;[hidden email]&amp;gt;
> wrote:
> &gt;
> &gt; &amp;gt; We have a search system based on Solr using the Solrnet
> library in C#
> &gt; which
> &gt; &amp;gt; supports some advanced search features like Fuzzy, Synonym
> and
> &gt; Stemming.
> &gt; &amp;gt; While all of these work, *the expectation from the Stemming
> Search
> &gt; seems to
> &gt; &amp;gt; be a combination of Stemming by reduction as well as
> stemming by
> &gt; expansion
> &gt; &amp;gt; to cover grammatical variations on a word*. A use case will
> make it
> &gt; more
> &gt; &amp;gt; clear:
> &gt; &amp;gt;
> &gt; &amp;gt;  - a search for fish would also find fishing
> &gt; &amp;gt;  - a search for applied would also find applying, applies,
> and apply
> &gt; &amp;gt;
> &gt; &amp;gt; We had implemented Stemming using a CopyField with
> &gt; &amp;gt; SnowballPorterFilterFactory. *As a result, when /searching
> for
> &gt; burning the
> &gt; &amp;gt; results are returning for burning and burn/ but when
> /searching for
> &gt; burn
> &gt; &amp;gt; the
> &gt; &amp;gt; results are not returning for burning or burnt or burns/*
> &gt; &amp;gt;
> &gt; &amp;gt; Since all stemmers supported Lucene/Solr all use stemming by
> &gt; reduction, we
> &gt; &amp;gt; are not sure on how to go about this. As per the Solr Wiki:
> &gt; &amp;gt;
> &gt; &amp;gt; &amp;gt; A related technology to stemming is lemmatization,
> which allows
> &gt; for
> &gt; &amp;gt; &amp;gt; "stemming" by expansion, taking a root word and
> 'expanding' it
> &gt; to all of
> &gt; &amp;gt; &amp;gt; its various forms. Lemmatization can be used either
> at insertion
> &gt; time or
> &gt; &amp;gt; &amp;gt; at query time. Lucene/Solr does not have built-in
> support for
> &gt; &amp;gt; &amp;gt; lemmatization but it can be simulated by using your
> own
> &gt; dictionaries and
> &gt; &amp;gt; &amp;gt; the SynonymFilterFactory
> &gt; &amp;gt;
> &gt; &amp;gt; We are not sure of exactly how to go about this in Solr. Any
> ideas.
> &gt; &amp;gt;
> &gt; &amp;gt; We were also thinking in terms of using some C# based
> &gt; stemmer/lemmatizer
> &gt; &amp;gt; library to get the root of the word and using some public
> database
> &gt; like
> &gt; &amp;gt; WordNet to extract the different grammatical variations of
> the stem
> &gt; and
> &gt; &amp;gt; then
> &gt; &amp;gt; send across all these terms for querying in Solr. We have
> not yet
> &gt; done a
> &gt; &amp;gt; lot
> &gt; &amp;gt; of research to figure out a stable C# stemmer/lemmatizer and
> a
> &gt; WordNet C#
> &gt; &amp;gt; API, but seems like this will get too convoluted and it
> should have a
> &gt; way
> &gt; &amp;gt; to
> &gt; &amp;gt; be executed from within Solr.
> &gt; &amp;gt;
> &gt; &amp;gt;
> &gt; &amp;gt;
> &gt; &amp;gt; --
> &gt; &amp;gt; View this message in context:
> &gt; &amp;gt;
> &gt;
> http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862.html
> &gt; &amp;gt; Sent from the Solr - User mailing list archive at Nabble.com.
> &gt; &amp;gt;
> &gt;
> &gt;
> &gt;
> &gt;    If you reply to this email, your message will be added to the
> &gt; discussion below:
> &gt;
> &gt;
> http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4073901.html
> &gt;   To unsubscribe from Stemming query in Solr, click here.
> &gt;  NAML
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; --
> &gt; View this message in context:
> &gt;
> http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4074283.html
> &gt; Sent from the Solr - User mailing list archive at Nabble.com.
> &gt;
>
>
>
>    If you reply to this email, your message will be added to the
> discussion below:
>
> http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4074311.html
>   To unsubscribe from Stemming query in Solr, click here.
>  NAML
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862p4074333.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message