lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fotis P <fotis...@gmail.com>
Subject Computing the similarity of documents
Date Thu, 21 May 2015 14:49:41 GMT
Hello everyone,

My task at hand is to compute the pairwise cosine similarity between a list
of documents.

I first index all the documents with DOCS_AND_FREQS option, then I
construct a query from every term of a document:

Query query =  parser.parse(document);

making sure to use the same analyzer in indexing and searching time.

I have also implemented my own similarity class so that I exclude coord(),
slopyfreq() etc. My implementation is here: http://pastebin.com/MArCs3ff

I still dont get the correct results however. Scoring results do make sense
from a search perspective, they are not however the values that I am
looking for.

I am bit lost as to what I should change to fine-tune the behaviour exactly
as I want it. The Lucene scoring formula for example confuses me with this
part: Σ tf(t in d)
<http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html#formula_tf>
This means that it only takes into account terms that exist in the query
(in my case a document) . Terms that exist in the other document but not in
the query do not alter the results, correct?

I hope what I am asking for is clear enough. If you need some more
information from me please ask.

Thank you in advance,

Fotios

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message