lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: A simple Vector Space Model and TFIDF usage
Date Tue, 30 Jun 2009 16:13:21 GMT

On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:

> Hi,
> It's my first experiment with Lucene. Please help me.
> I'm going to index a set of documents and create a feature vector  
> for each of them. This vector contains all terms belong to the  
> document that weight using TFIDF.
> After that I want to compute the cosine similarity between all  
> documents and produce a doc-doc similarity matrix. My document set  
> is large and it's important to have a scalable implementation.

See Mahout (  In the utils module, is  
a class called LuceneIterable that the o.a.mahout.utils.vectors.Driver  
program can use to convert a Lucene index into a Mahout Vector  
representation, which can then be used to create a d-d similarity  
matrix.  It uses Hadoop, so you can go as big as you want.



Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message