lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolas Tautenhahn <nik_s...@livinglogic.de>
Subject Re: Tokenising on Each Letter
Date Mon, 23 Aug 2010 14:13:15 GMT
Hi Scottie,

> Could you elaborate about N gram for me, based on my schema?

just a quick reply:

>     <fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/> -->
> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"
preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> 		<filter class="solr.EdgeNGramFilterFactory" side="front" minGramSize="2" maxGramSize="30"
/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"
preserveOriginal="1"/>
> 		<filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>

Will produce any NGrams from 2 up to 30 Characters, for Info check
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that
maxGramSize is big enough to keep the whole original serial number/model
number and minGramSize is not so small that you fill your index with
useless information.

Best regards,
Nikolas Tautenhahn



Mime
View raw message