lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How should I configure Solr to support multi-word synonyms?
Date Mon, 04 Mar 2013 22:20:55 GMT
If you want multi-term synonyms at query time, you will need to enclose the 
sequence of terms in quotes. Otherwise, the query analyzer will see only one 
term at a time and not recognize any multi-term synonyms.

Note that the synonym filter will need to see "phrase one" as two separate 
terms, so using the keyword tokenizer will not work since it will treat 
"phrase one" as a single term.

-- Jack Krupansky

-----Original Message----- 
From: David Sharpe
Sent: Monday, March 04, 2013 1:40 PM
To: solr-user@lucene.apache.org
Subject: How should I configure Solr to support multi-word synonyms?

Hello Solr mailing list,

I have read many posts and run many tests, but still I cannot get
multi-word synonyms behaving the way I think they should. I would
appreciate your advice.

Here is an example of the behaviour I am trying to achieve:

*# Given synonyms.txt
wordOne, phrase one
*


   1. At index time, a document containing "wordOne" should expand to
   "wordOne | phrase one". A query for "wordOne" or "phrase one" should find
   the document, but a query for just "phrase" or "one" should not find the
   document.

   2. Conversely, a document containing "phrase one" should expand to
   "phrase one | wordOne". A query for "wordOne" or "phrase one" should find
   the document. (Depending on field tokenization, I would also expect
   "phrase" and "one" to find the document.)

To attempt to achieve this behaviour, I have downloaded Solr 4.1.0 and made
the following changes to
"solr-4.1.0\example\solr\collection1\conf\schema.xml":

https://gist.github.com/sharpedavid/5072150


(Note that I set SynonymFilterFactor
tokenizerFactory="solr.KeywordTokenizerFactory". This is to prevent
"wordOne" from being expanded to "wordOne | phrase | one".)

Achieving the first behaviour (i.e. number one in the above list) seems
difficult. A query for "wordOne" returns the document, but a query for
"phrase one" returns nothing. I realized that the query tokenizer tokenized
my query for "phrase one", so I changed the query tokenizer to
KeywordTokenizer, which achieves the desired behaviour, but now queries are
not tokenized at all, which breaks other desirable behaviour.

The second behaviour (i.e. number two in the above list) has similar
problems, but no solution that I can see. If the index tokenizer is
StandardTokenizer, "phrase one" is tokenized to "phrase | one", so the
equivalent synonym is not matched. If I change the index tokenizer to
KeywordTokenizer, it does match; however, KeywordTokenizer will treat the
entire field as a a single token, so a document containing "something
phrase one something" will not match the equivalent synonym, and also a
query for "phrase" or "one" will not find the document.

Thank you for your time.

Sincerely,
David Sharpe 


Mime
View raw message