lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Fernandez <difer...@redhat.com>
Subject Re: WordDelimiter filter, expanding to multiple words, unexpected results
Date Tue, 02 Sep 2014 22:01:06 GMT
Although not a solution, this may help in trying to find the problem.
In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says:

"It is worth noting that there is an additional attribute for the text field type:

    autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing tokens. Some
filters (such as WordDelimiterFilter) can divide tokens into a set of tokens. Setting the
attribute to true (default value) will automatically generate phrase queries. This means that
WordDelimiterFilter will divide the word “wi-fi” into two tokens “wi” and “fi”.
With autoGeneratePhraseQueries set to true query sent to Lucene will look like "field:wi fi",
while with set to false Lucene query will look like field:wi OR field:fi. However, please
note, that this attribute only behaves well with tokenizers based on white spaces."

Since phrases are made by looking at the position, it is possible that the position set for
the other generated tokens have something to do with it.  Have you tried turning autoGeneratePhraseQueries="false"
to see if it'll match both? (I know that might have other unintended behaviors but it might
give some insight into the problem)

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



----- Original Message -----
> On 9/2/14 1:51 PM, Erick Erickson wrote:
> > bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
> > not "macbook"
> >
> > I suspect your query parameters for WordDelimiterFilterFactory doesn't have
> > catenate words set.
> >
> > What do you see when you enter these in both the index and query portions
> > of the admin/analysis page?
> 
> Thanks Erick!
> 
> Our WordDelimiterFilterFactory does have catenate words set, in both
> index and query phases (is that right?):
> 
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> 
> It's hard to cut and paste the results of the analysis page into email
> (or anywhere!), I'll give you screenshots, sorry -- and I'll give them
> for our whole real world app complex field definition. I'll also paste
> in our entire field definition below. But I realize my next step is
> probably creating a simpler isolation/reproduction case (unless you have
> a magic answer from this!).
> 
> Again, the problem is that "MacBook" seems to be only matching on
> indexed "macbook" and not indexed "mac book".
> 
> 
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
> 
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
> 
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
> 
> 
> Our entire actual field definition:
> 
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>        <analyzer>
>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
> stripping punctuation,
>              so our synonym filter involving C++ etc can still work.
>              From:
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.6070409@elyograg.org%3E
>              the rbbi file is in our local ./conf, copied from lucene
> source tree -->
>         <tokenizer class="solr.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
> 
>         <filter class="solr.SynonymFilterFactory"
> synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
> 
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>               can do it's thing with full cases and such -->
>          <filter class="solr.ICUFoldingFilterFactory" />
> 
> 
>          <!-- ICUFolding already includes lowercasing, no
>               need for seperate lowercasing step
>          <filter class="solr.LowerCaseFilterFactory"/>
>          -->
> 
>          <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
> 
> 
> 
> 
> 

Mime
View raw message