lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <>
Subject RE: Indexing synonyms
Date Mon, 11 Nov 2002 19:45:56 GMT
On Mon, 11 Nov 2002, Spencer, Dave wrote:

> Re "reducing the set of question/answer pair to consider" below - I
> would expect that using synonyms either in the index or in the
> reformed query would (annoyingly)  increase the number of potential
> matches or is there something I'm missing.

Generally, you're right.  

More formally, in the information retrieval community, 'recall' is defined
roughly as n/r, and 'precision' as n/p, where 
* n is the # of relevant articles returned in response to the query
* r is the total # of articles relevant to the query
* p is the total # of articles returned by the query.
(These are analogous to "completeness" and "correctness" of formal

So things that tend to increase recall tend to decrease precision, and
vice versa.

Not coincidentally, one area of my research is in investigating methods
that increase recall (via query expansion) but do not significantly
adversely affect precision (by assigning weights to terms added to the
query according to their aggregated similarity to the query terms).  No
conclusive results yet, I'm afraid.
> Interesting that this topic just came up as I wanted to experiment
> w/ the same thing. My first stab at an public domain synonym
> list, the "moby" list, didn't seem to have synonyms however. 
> I believe another poster mentioned WordNet so I'll try that.
> I'd really like it if it was possibly to automatically determine
> synonyms - maybe something similar to Latent Semantic Analysis - but
> such things seem kinda hard to code up...

LSA does have some advantages, but it has problems as well (e.g., last I
checked, it was rather computationally expensive).  There are other
mechanisms for determining synonym-like relationships, however, such as
measuring term-term correlations in the corpus.  

Something you have to be careful of in this context is in assuming that
synonyms are symmetric: the connection of 'hot' to 'radioactive' (or
'spicy', or 'attractive', or ...) is not nearly as strong as the
connections going the other direction.  You also can get problems with
homonyms like 'minute' (time period) and 'minute' (very small); clearly
these two demand different classes of related terms. Per
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message