lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mtdowling <mtdowl...@gmail.com>
Subject Solr synonyms format query time vs index time
Date Tue, 17 Aug 2010 18:23:38 GMT

My company recently started using Solr for site search and autocomplete. 
It's working great, but we're running into a problem with synonyms.  We are
generating a synonyms.txt file from a database table and using that
synonyms.txt file at index time on a text type field.  Here's an excerpt
from the synonyms file:

reebox => Reebok
shinguards => Shin Guards
shirt => T-Shirt,Shirt
shmak => Shmack
shocks => shox
skateboard => Skate
skateboarding => Skate
skater => Skate
skates => Skate
skating => Skate
skirt => Dresses

When we do a search for reebox, we want the term to be mapped to "Reebok"
through explicit mapping, but for some reason this isn't happening.  We do
have multi-word synonyms, and from what I've read on the mailing list, those
only work at index time, so we are only using the synonym filter factory at
index time:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

Here's more relevant schema.xml configs:

<field name="mashup" type="search" indexed="true" stored="false"
multiValued="true"/>
<copyField source="keywords" dest="mashup"/>
<copyField source="category" dest="mashup"/>
<copyField source="name" dest="mashup"/>
<copyField source="brand" dest="mashup"/>
<copyField source="description_overview" dest="mashup"/>
<copyField source="sku" dest="mashup"/>
<!-- other copy fields... -->

The output of the query analyzer shows the following:

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload

So "reebox" is never being converted to "Reebok".  I thought that if I had
index time synonyms with expansion configured that I wouldn't need query
time synonyms.  Maybe my dynamic synonyms generation isn't formatted
correctly for my desired result?

If I use the same synonyms.txt file and use the index analyzer, reebox is
mapped to Reebok and then indexed correctly:

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	


Should I use equivalent mapping instead of explicit mapping if I'm only
using index-time synonyms?  Or should I turn query time synonyms on for my
search field?

Thanks,
Michael
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-synonyms-format-query-time-vs-index-time-tp1192743p1192743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message