lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Fri, 30 Oct 2009 03:02:13 GMT
Thank you very much...I shall try out the tokenizerFactory attribute on
SynonymFilterFactory

On Tue, Oct 13, 2009 at 12:27 AM, Chris Hostetter
<hossman_lucene@fucit.org>wrote:

>
> : I had to be brief as my facets are in the order of 100K over 800K
> documents
> : and also if I give the complete schema.xml I was afraid nobody would read
> my
> : long message :-) ..Hence I showed only relevant pieces of the result
> showing
> : different fields having same problem
>
> relevant is good, but you have to provide a consistent picture from start
> to finish ... you don't need to show 1,000 lines of facet field output,
> but you at least need to show the field names.
>
> :     <fieldType name="keywordText" class="solr.TextField"
> : sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> :       <analyzer type="index">
> :         <tokenizer class="solr.KeywordTokenizerFactory"/>
> :         <filter class="solr.TrimFilterFactory" />
> :         <filter class="solr.StopFilterFactory" ignoreCase="true"
> : words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"/>
> :
> :         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> : ignoreCase="true" expand="false" />
> :         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> :       </analyzer>
>
> ...have you used analysis.jsp to see what terms that analyzer produces
> based on the strings you are indexing for your documents?  becuase
> combined with synonyms like this...
>
> : New York, N.Y., NY => New York
>
> ...it doesn't suprise me that you're getting "New" as an indexed term.
> By default SynonymFilter uses whitespace to delimit tokens in multi-token
> synonyms, so for some input like "NY" you should see it produce the token
> "New" and "York"
>
> you can use the tokenizerFactory attribute on SynonymFilterFactory to
> specify a TokenizerFactory class to use when parsing synonyms.txt
>
>
>
> -Hoss
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message