lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scottie <scot...@live.com>
Subject Re: Tokenising on Each Letter
Date Mon, 23 Aug 2010 16:16:48 GMT

Nikolas, thanks a lot for that, I've just gave it a quick test and it definitely seems to
work for the examples I've gave.

Thanks again,

Scott


From: Nikolas Tautenhahn [via Lucene] 
Sent: Monday, August 23, 2010 3:14 PM
To: Scottie 
Subject: Re: Tokenising on Each Letter


Hi Scottie, 

> Could you elaborate about N gram for me, based on my schema? 

just a quick reply: 


>     <fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100">

>       <analyzer type="index"> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
>         <!-- in this example, we will only use synonyms at query time 
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/> --> 
> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"
preserveOriginal="1"/> 
>         <filter class="solr.LowerCaseFilterFactory"/> 
> <filter class="solr.EdgeNGramFilterFactory" side="front" minGramSize="2" maxGramSize="30"
/> 
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
>       </analyzer> 
>       <analyzer type="query"> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/> 
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"
preserveOriginal="1"/> 
> <filter class="solr.LowerCaseFilterFactory"/> 
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
>       </analyzer> 
>     </fieldType> 

Will produce any NGrams from 2 up to 30 Characters, for Info check 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that 
maxGramSize is big enough to keep the whole original serial number/model 
number and minGramSize is not so small that you fill your index with 
useless information. 

Best regards, 
Nikolas Tautenhahn 





--------------------------------------------------------------------------------

View message @ http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html

To unsubscribe from Tokenising on Each Letter, click here. 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message