My company recently started using Solr for site search and autocomplete.
It's working great, but we're running into a problem with synonyms. We are
generating a synonyms.txt file from a database table and using that
synonyms.txt file at index time on a text type field. Here's an excerpt
from the synonyms file:
reebox => Reebok
shinguards => Shin Guards
shirt => T-Shirt,Shirt
shmak => Shmack
shocks => shox
skateboard => Skate
skateboarding => Skate
skater => Skate
skates => Skate
skating => Skate
skirt => Dresses
When we do a search for reebox, we want the term to be mapped to "Reebok"
through explicit mapping, but for some reason this isn't happening. We do
have multi-word synonyms, and from what I've read on the mailing list, those
only work at index time, so we are only using the synonym filter factory at
index time:
<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Here's more relevant schema.xml configs:
<field name="mashup" type="search" indexed="true" stored="false"
multiValued="true"/>
<copyField source="keywords" dest="mashup"/>
<copyField source="category" dest="mashup"/>
<copyField source="name" dest="mashup"/>
<copyField source="brand" dest="mashup"/>
<copyField source="description_overview" dest="mashup"/>
<copyField source="sku" dest="mashup"/>
<!-- other copy fields... -->
The output of the query analyzer shows the following:
Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
So "reebox" is never being converted to "Reebok". I thought that if I had
index time synonyms with expansion configured that I wouldn't need query
time synonyms. Maybe my dynamic synonyms generation isn't formatted
correctly for my desired result?
If I use the same synonyms.txt file and use the index analyzer, reebox is
mapped to Reebok and then indexed correctly:
Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 1
term text reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 1
term text reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 1
term text reebok
term type word
source start,end 0,6
payload
Should I use equivalent mapping instead of explicit mapping if I'm only
using index-time synonyms? Or should I turn query time synonyms on for my
search field?
Thanks,
Michael
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-synonyms-format-query-time-vs-index-time-tp1192743p1192743.html
Sent from the Solr - User mailing list archive at Nabble.com.
|