lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 13 Oct 2009 04:27:29 GMT

: I had to be brief as my facets are in the order of 100K over 800K documents
: and also if I give the complete schema.xml I was afraid nobody would read my
: long message :-) ..Hence I showed only relevant pieces of the result showing
: different fields having same problem

relevant is good, but you have to provide a consistent picture from start 
to finish ... you don't need to show 1,000 lines of facet field output, 
but you at least need to show the field names.

:     <fieldType name="keywordText" class="solr.TextField"
: sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
:       <analyzer type="index">
:         <tokenizer class="solr.KeywordTokenizerFactory"/>
:         <filter class="solr.TrimFilterFactory" />
:         <filter class="solr.StopFilterFactory" ignoreCase="true"
: words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>
:         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
: ignoreCase="true" expand="false" />
:         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
:       </analyzer>

...have you used analysis.jsp to see what terms that analyzer produces 
based on the strings you are indexing for your documents?  becuase 
combined with synonyms like this...

: New York, N.Y., NY => New York doesn't suprise me that you're getting "New" as an indexed term.  
By default SynonymFilter uses whitespace to delimit tokens in multi-token 
synonyms, so for some input like "NY" you should see it produce the token 
"New" and "York"

you can use the tokenizerFactory attribute on SynonymFilterFactory to 
specify a TokenizerFactory class to use when parsing synonyms.txt


View raw message