lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Sharpe <>
Subject How should I configure Solr to support multi-word synonyms?
Date Mon, 04 Mar 2013 18:40:20 GMT
Hello Solr mailing list,

I have read many posts and run many tests, but still I cannot get
multi-word synonyms behaving the way I think they should. I would
appreciate your advice.

Here is an example of the behaviour I am trying to achieve:

*# Given synonyms.txt
wordOne, phrase one

   1. At index time, a document containing "wordOne" should expand to
   "wordOne | phrase one". A query for "wordOne" or "phrase one" should find
   the document, but a query for just "phrase" or "one" should not find the

   2. Conversely, a document containing "phrase one" should expand to
   "phrase one | wordOne". A query for "wordOne" or "phrase one" should find
   the document. (Depending on field tokenization, I would also expect
   "phrase" and "one" to find the document.)

To attempt to achieve this behaviour, I have downloaded Solr 4.1.0 and made
the following changes to

(Note that I set SynonymFilterFactor
tokenizerFactory="solr.KeywordTokenizerFactory". This is to prevent
"wordOne" from being expanded to "wordOne | phrase | one".)

Achieving the first behaviour (i.e. number one in the above list) seems
difficult. A query for "wordOne" returns the document, but a query for
"phrase one" returns nothing. I realized that the query tokenizer tokenized
my query for "phrase one", so I changed the query tokenizer to
KeywordTokenizer, which achieves the desired behaviour, but now queries are
not tokenized at all, which breaks other desirable behaviour.

The second behaviour (i.e. number two in the above list) has similar
problems, but no solution that I can see. If the index tokenizer is
StandardTokenizer, "phrase one" is tokenized to "phrase | one", so the
equivalent synonym is not matched. If I change the index tokenizer to
KeywordTokenizer, it does match; however, KeywordTokenizer will treat the
entire field as a a single token, so a document containing "something
phrase one something" will not match the equivalent synonym, and also a
query for "phrase" or "one" will not find the document.

Thank you for your time.

David Sharpe

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message