lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Clarifications to Synonym Filter Wiki entry? (1 of 2)
Date Mon, 24 Aug 2009 17:47:45 GMT
There are a couple of things about the Solr Thesaurus doc that I'd like to
confirm / understand.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter

I believe the following section is a bit misleading; I'm sure it's correct
for the case it describes, but there's another case I've tested, which on
the surface seemed similar, but where the actual results were different and
in hindsight not really a conflict, just a surprise.

At the bottom of the gray Synonym file format box it shows the example:
    #multiple synonym mapping entries are merged.
    foo => foo bar
    foo => baz
    #is equivalent to
    foo => foo bar, baz

Whereas I was using non-explicit / reflexive mappings with overlapping
terms, for example:
    A, B, C, D
    A, E, I, O, U
(assume these are real non-single-letter words, the word "a" is often
stopped out of course)

Assuming expand="true", and reading the wiki, I would have thought the
groups would be merged, to be effectively:
    A, B, C, D, E, I, O, U

This is NOT the case, which is actually good in my opinion.

At index time, if an A is seen, it WILL be expanded to also include B, C, D
and E, I, O, U.  This is true even if A is not listed first.

However, if the indexer encounters B, it will ONLY be expanded with A, C and
D.  Similarly, E will be augmented with A, I, O and U.

I tested this by actually looking at the word index with Luke.

If you DID want the merged behavior, where D would expand to match all 9
letters you can either:
1: Put the synonym filter in the pipeline twice, along with the remove
duplicates filter
OR
2: Use the synonym filter at both index and query time

Does anybody disagree with this?

And what should be added to the Wiki doc?

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message