lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Au <bill.w...@gmail.com>
Subject Re: question about text field and WordDelimiterFilter in example schema.xml
Date Wed, 28 Oct 2009 02:31:24 GMT
I have been playing with this using the analysis.jsp.  I am still not clear
why we don't want to catenate at query time.  Here is my example.

With the current text field, the query term "iPhone" will not match document
containing the string "iphone" because "iPhone" is analyzed into two terms:
i(1) and phone(2).  I am using a lower case filter.

If I set catenateWords to 1, the "iPhone" is analyzed into:

term position 1: i
term position 2: phone iphone

So that will match document containing the string "iphone"

What bad things can happen if I split and catenate at query time?

Bill

On Tue, Oct 20, 2009 at 8:09 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> On Tue, Oct 20, 2009 at 6:37 PM, Bill Au <bill.w.au@gmail.com> wrote:
> > I have a question regarding the use of the WordDelimiterFilter in the
> text
> > field in the example schema.xml.  The parameters are set differently for
> the
> > indexing and querying.  Namely, catenateWords and catenateNumbers are set
> > differently.  Shouldn't the same analysis be done at both index and query
> > time?
>
> That wouldn't work... of you tried to split and catenate at query time then
> foo-bar would generate the tokens "foo/foobar,bar"  (foo and foobar
> tokens overlapping).
> The Lucene query parser considers this to mean "(foo or foobar)
> followed by bar", which is clearly not good.
>
> It's essentially the same problem that keeps us from using synonym
> expansion at query time with synonyms greater than length 1.
>
> -Yonik
> http://www.lucidimagination.com
>
> > Bill
> >
> >    <!-- A text field that uses WordDelimiterFilter to enable splitting
> and
> > matching of
> >        words on case-change, alpha numeric boundaries, and
> non-alphanumeric
> > chars,
> >        so that a query of "wifi" or "wi fi" could match a document
> > containing "Wi-Fi".
> >        Synonyms and stopwords are customized by external files, and
> > stemming is enabled.
> >        -->
> >    <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message