lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Clarifications to Synonym Filter Wiki entry? (2 of 2)
Date Mon, 24 Aug 2009 17:47:57 GMT
There are a couple of things about the Solr Thesaurus doc that I'd like to
confirm / understand.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter

There's a section about multi word matching, using seabiscit as an example.
I've also seen references to this discussion in posts talking about dismax
and the synonym filter.  (quoted below).  Where I think it could use some
additional clarification is in this sentence:
"The recommended approach ... is to expand the synonym when indexing."

The section below describes why not doing it this way won't work, but it
doesn't explain how using index-time expansion fixes it.  In particular,
even if I do index time expansion, isn't a multi word input synonym still
doing to be messed with by the Lucene parser.  From the Wiki "The Lucene
QueryParser tokenizes on white space before giving any text to the
Analyzer... ".  Understood, but how does index time expansion address that,
either directly or indirectly?

> Keep in mind that while the SynonymFilter will happily work with synonyms
containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit")
> The recommended approach for dealing with synonyms like this, is to expand
the synonym when indexing. This is because there are two
> potential issues that can arrise at query time:
>
> 1: The Lucene QueryParser tokenizes on white space before giving any text
to the Analyzer, so if a person searches for the words
> sea biscit the analyzer will be given the words "sea" and "biscit"
seperately, and will not know that they match a synonym.
>
> 2: Phrase searching (ie: "sea biscit") will cause the QueryParser to pass
the entire string to the analyzer, but if the SynonymFilter
> is configured to expand the synonyms, then when the QueryParser gets the
resulting list of tokens back from the Analyzer, it will
> construct a MultiPhraseQuery that will not have the desired effect. This
is because of the limited mechanism available for the
> Analyzer to indicate that two terms occupy the same position: there is no
way to indicate that a "phrase" occupies the same position
> as a term. For our example the resulting MultiPhraseQuery would be "(sea |
sea | seabiscuit) (biscuit | biscit)" which would not match
> the simple case of "seabisuit" occuring in a document

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message