lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: Document-Document similarity
Date Tue, 07 Oct 2003 19:00:09 GMT

Why not perform document-as-query?  That is, parse a document to 
produce a query, submit the query, and get a list of documents ranked 
by similarity.

Are you trying to do clustering?  Write a custom analyzer which saves 
the analysis of each document as it's parsed for the indexing process, 
then iterate through all of the documents, submit each as a query, and 
collect the results.

Or pseudo-relevance feedback?  Re-parse the top N documents resulting 
from a given query, bundle up the results as another query, then 
recombine the scores after you weight the components (Rocchio's 
formula; the full thing also involves a negatively reinforcing 
component -- re-parse the bottom M documents resulting from the 
initial query, package as another query, then use a negative weight 
when combining with other components' scores -- but this step doesn't 
seem to contribute positively in a reliable fashion to the overall 

Steve Rowe

Maurice Coyle wrote:
> does anyone know of a way to get the similarity between two documents as
> opposed to between a document and a query?  at the moment, i'm forced to
> make a term-frequency vector for each document and get the cosine of the
> angle between them, but i was hoping there was a more elegant way of doing
> this using either the lucene api (although from my study of it it doesnt
> look like this is the case) or some other class library that another lucene
> user has created.
> any help much appreciated.
> maurice

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message