lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spencer, Dave" <>
Subject RE: Indexing synonyms
Date Mon, 11 Nov 2002 19:01:06 GMT
Re "reducing the set of question/answer pair to
consider" below - I would expect that using synonyms either
in the index or in the reformed query would (annoyingly) 
increase the number of potential matches or is there
something I'm missing.

Interesting that this topic just came up as I wanted to experiment
w/ the same thing. My first stab at an public domain synonym
list, the "moby" list, didn't seem to have synonyms however. 
I believe another poster mentioned WordNet so I'll try that.

I'd really like it if it was possibly to automatically determine
synonyms - maybe something similar to Latent Semantic Analysis - but
such things seem kinda hard to code up...

-----Original Message-----
From: Aaron Galea []
Sent: Sunday, November 10, 2002 4:18 PM
To: Lucene Users List;
Subject: Re: Indexing synonyms

Thanks for all your replies,

Well I will start of with an idea of what I am trying to achieve. I am
building a question answer system and one of its modules is an FAQ
Since the QA system is concerned with education, users can concentrate
question on a particular subject reducing the set of question/answer
pair to
consider. Since there is this hierarchical indexing the index files are
that big so I can store synonyms for each word in a question in the
Query expansion will solve the problem and eliminating the need to store
synonyms in the index but this will slow things as there is no depth
to consider for term expansion. It is not my intension to build
similar to the FAQFinder system but I want to further reduce the subset
questions to consider on which a question reformulation algorithm would
applied. Therefore the idea is get an faq file dealing with one
subject, index all of its questions and expand each term in the
Using lucene I will retrieve the questions that are likely to be similar
a user question, select say the top 5 and apply a query reformulation
algorithm. If this succeeds fine and I return the answer to user,
submit the question to an answer extraction module. The most important
is speed so putting term expansion in the index hopefully should improve
things. Obviously problems arise with this method as there is no word
disambiguation but the query reformulation algorithm will solve this.
However it is slow so I must reduce the number of questions it is
on. It is a tradeoff!!!

Well I managed to solve this by overriding the next() method and when it
gets to an EOS I start returning the new expanded terms that I
in a list.

Thanks everyone for your reply!!!!


NB : And yep I am a Malteser Otis ! :)

----- Original Message -----
From: "Alex Murzaku" <>
To: "'Lucene Users List'" <>
Sent: Monday, November 11, 2002 12:17 AM
Subject: RE: Indexing synonyms

> You could also do something with org.apache.lucene.analyzer.Token
> includes the following self-explanatory note:
>   /** Set the position increment.  This determines the position of
> token
>    * relative to the previous Token in a {@link TokenStream}, used in
> phrase
>    * searching.
>    *
>    * <p>The default value is one.
>    *
>    * <p>Some common uses for this are:<ul>
>    *
>    * <li>Set it to zero to put multiple terms in the same position.
> This is
>    * useful if, e.g., a word has multiple stems.  Searches for phrases
>    * including either stem will match.  In this case, all but the
> stem's
>    * increment should be set to zero: the increment of the first
> instance
>    * should be one.  Repeating a token with an increment of zero can
> also be
>    * used to boost the scores of matches on that token.
>    *
>    * <li>Set it to values greater than one to inhibit exact phrase
> matches.
>    * If, for example, one does not want phrases to match across
> stop
>    * words, then one could build a stop word filter that removes stop
> words and
>    * also sets the increment to the number of stop words removed
> each
>    * non-stop word.  Then exact phrase queries will only match when
> terms
>    * occur with no intervening stop words.
>    *
>    * </ul>
>    * @see TermPositions
>    */
>   public void setPositionIncrement(int positionIncrement) {
>     if (positionIncrement < 0)
>       throw new IllegalArgumentException
>         ("Increment must be positive: " + positionIncrement);
>     this.positionIncrement = positionIncrement;
>   }
> --
> Alex Murzaku
> ___________________________________________
>  alex(at)
> -----Original Message-----
> From: Otis Gospodnetic []
> Sent: Sunday, November 10, 2002 1:30 PM
> To: Lucene Users List
> Subject: Re: Indexing synonyms
> .mt?  Malta?  That's rare! :)
> A person called Clemens Marschner just submitted diffs for query
> rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in CVS
> yet, and they are a bit old now becase the code they were made against
> has changed since they were made. You could either try applying them
> yourself, of waiting until they get applied and then you could get a
> nightly build.
> Otis
> --- Aaron Galea <> wrote:
> > Hi everyone,
> >
> > I need to create a filter that extends a tokenfilter whose purpose
> > to generate some synonyms for words in the document using Wordnet.
> > Well searching for synonyms using wordnet is not that problematic
> > I need to add the synonym words to Lucene tokenstream before they
> > passed for indexing. However TokenStream class does not support any
> > add method. Did anyone ever needed to do this? Can someone suggest
> > alternative of how to add some synonym words to the index?
> >
> > Thanks
> > Aaron
> >
> __________________________________________________
> Do you Yahoo!?
> U2 on LAUNCH - Exclusive greatest hits videos
> --
> To unsubscribe, e-mail:
> <>
> For additional commands, e-mail:
> <>
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---
> [This E-mail was scanned for spam and viruses by]

[This E-mail was scanned for spam and viruses by]

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message