lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scottie <scot...@live.com>
Subject Re: Tokenising on Each Letter
Date Mon, 23 Aug 2010 14:00:09 GMT

Probably a good idea to post the relevant information! I guess I thought it
would be a really obvious answer but it seems its a bit more complex ;)

<field name="productsModel" type="textTight" indexed="true" stored="true"
omitNorms="true"/>

    <!-- Less flexible matching, but less false matches.  Probably not ideal
for product names,
         but may be good for SKUs.  Can insert dashes in the wrong place and
still match. -->
    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <!-- this filter can remove any duplicate tokens that appear at the
same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming.
-->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

It seems you may be correct about the catenateAll option, but I'm not sure
if adding in a wildcard at the end of every search would be a great idea?
This is meant to be applied to a general search box, but still retain
flexibility for model numbers. Right now, we are using mySQL % % wildcards
so it matches pretty much anything on the model number, whether you cut off
the start or the end etc, and I wanted to retain that.

Could you elaborate about N gram for me, based on my schema?

The main reason I picked TextTight was for model numbers like
EQW-500DBE-1AVER etc, I thought it would produce better results?

Thanks a lot for the detailed reply.

Scott
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message