lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Clarifications to Synonym Filter Wiki entry? (2 of 2)
Date Mon, 24 Aug 2009 20:24:10 GMT
Here's the best thread I've found so far about multi-word matching and
synonyms:
http://www.nabble.com/solr-synonyms-behaviour-ts15051211.html#a18476205

And an interesting workaround:
http://www.nabble.com/solr-synonyms-behaviour-ts15051211.html#a18693735

Earlier on the thread repeats the claim that, if you use index side
expansion, you won't have a problem.  But it doesn't explain how/why that
fixes it, given that the Lucene parser still breaks on white space.

Later there's a clue, it seems that even single words of a multi-word
thesaurus entry are matched - so I guess Lucene doesn't need to see both
words in a multi-word query, it just picks up either word, so it works
around the multi-word parsing problem, but adds the undesireable side effect
of false positive matches?

So the repeated claim that index side expansion fixes multi-word matching
should always carry the caveat "... and it can cause false positive matches
when only one of the words is present?"

Am I understanding this correctly?  If true, it's to be acceptable in many
applications, it's just a question understanding the trade offs.

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 24, 2009 at 10:47 AM, Mark Bennett <mbennett@ideaeng.com> wrote:

> There are a couple of things about the Solr Thesaurus doc that I'd like to
> confirm / understand.
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter
>
> There's a section about multi word matching, using seabiscit as an
> example.  I've also seen references to this discussion in posts talking
> about dismax and the synonym filter.  (quoted below).  Where I think it
> could use some additional clarification is in this sentence:
> "The recommended approach ... is to expand the synonym when indexing."
>
> The section below describes why not doing it this way won't work, but it
> doesn't explain how using index-time expansion fixes it.  In particular,
> even if I do index time expansion, isn't a multi word input synonym still
> doing to be messed with by the Lucene parser.  From the Wiki "The Lucene
> QueryParser tokenizes on white space before giving any text to the
> Analyzer... ".  Understood, but how does index time expansion address that,
> either directly or indirectly?
>
> > Keep in mind that while the SynonymFilter will happily work with synonyms
> containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit")
> > The recommended approach for dealing with synonyms like this, is to
> expand the synonym when indexing. This is because there are two
> > potential issues that can arrise at query time:
> >
> > 1: The Lucene QueryParser tokenizes on white space before giving any text
> to the Analyzer, so if a person searches for the words
> > sea biscit the analyzer will be given the words "sea" and "biscit"
> seperately, and will not know that they match a synonym.
> >
> > 2: Phrase searching (ie: "sea biscit") will cause the QueryParser to pass
> the entire string to the analyzer, but if the SynonymFilter
> > is configured to expand the synonyms, then when the QueryParser gets the
> resulting list of tokens back from the Analyzer, it will
> > construct a MultiPhraseQuery that will not have the desired effect. This
> is because of the limited mechanism available for the
> > Analyzer to indicate that two terms occupy the same position: there is no
> way to indicate that a "phrase" occupies the same position
> > as a term. For our example the resulting MultiPhraseQuery would be "(sea
> | sea | seabiscuit) (biscuit | biscit)" which would not match
> > the simple case of "seabisuit" occuring in a document
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message