lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Re: Including phonetic search in text field
Date Mon, 23 May 2011 20:18:06 GMT
Ah, yes very helpful thanks Paul.  I knew there would be something that I
broke :).  I will need to go back and consider the use cases and see which
will and will not require exact matches.  Thanks again!


I have never heard of DisMax so this is new to me as well but have found
some posts about it.  I am sure this will generate other questions :)  Again
thanks.

On Mon, May 23, 2011 at 3:56 PM, Paul Libbrecht <paul@hoplahup.net> wrote:

> Jamie,
>
> the problem with that is that you cannot do exact matching anymore.
> For this reason, it is good style to have two fields, to use a query
> expander such as dismax (prefer exact matches, and less phonetic matches),
> and to only use that when you sort by score.
>
> hope it helps
>
> paul
>
>
> Le 23 mai 2011 à 21:43, Jamie Johnson a écrit :
>
> > I am new to solr and am trying to determine the best way to take the text
> > field type (the one in the example) and add phonetic searches to it.
> > Currently I have done the following:
> >
> >    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.DoubleMetaphoneFilterFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >        <filter class="solr.PorterStemFilterFactory"/>
> >
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.DoubleMetaphoneFilterFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >        <filter class="solr.PorterStemFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > which seems to work.  Is this appropriate or is there a better way of
> doing
> > this?  I had previously defined a custom phonetic field but that would
> mean
> > for each field which I wanted to support a phonetic style search I would
> > need to add an additional field.  Adding it to the text seemed much more
> > elegant since it would work for all text fields.  Is there a reason not
> to
> > do this (i.e. performance, index size, etc)?  Any insight/guidance would
> be
> > greatly appreciated.
> >
> > Also if anyone could point me to what exactly filters do (docs) I would
> > appreciate it.  My assumption is that they inject additional tokens based
> on
> > the specific filter class.  Am I correct?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message