mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Calculating cosine similarity for vectors extracted from Lucene
Date Sun, 12 Jun 2011 10:46:27 GMT
Hi Andrew,

You're right, the key of the data for RowSimilarityJob needs to be an 
IntWritable. The number of columns (which is the number of distinct 
terms in your case) is only relevant if you use loglikelihood ratio as 
similarity measure, it is ignored when you use cosine.

I think you would need to write some wrapping code, maybe a look into 
ItemSimilarityJob might help, this code uses RowSimilarityJob for 
calculating item-item-similarities in Collaborative Filtering.

Having a "DocumentSimilarityJob" in Mahout would be a cool thing, maybe 
you wanna share your code afterwards?

--sebastian


On 12.06.2011 02:29, Andrew Clegg wrote:
> Hi,
>
> I extracted the contents of a Lucene index like so:
>
> bin/mahout lucene.vector --dir /path/to/index/ --output
> /path/to/vectors --dictOut /path/to/dict --field text --idField id
> --weight TFIDF --maxDFPercent 90 --minDF 10
>
> And then I tried to get the cosine similarity between the docs like so:
>
> bin/mahout rowsimilarity -i /path/to/vectors -o /path/to/25nn-matrix
> -s SIMILARITY_UNCENTERED_COSINE -m 25 -r 10000000
>
> But I got this:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be cast to org.apache.hadoop.io.IntWritable
> 	at org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.map(RowSimilarityJob.java:198)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
> I assume this refers to the document ID or something -- since the
> actual tf.idf scores will be doubles, right? Is there an easy way to
> convert these on the fly, or do I need to write something to do it?
>
> Also, another (somewhat unrelated) question... The -r param to
> rowsimilarity specifies "Number of columns in the input matrix".
> What's the recommended approach when you don't know this in advance?
> Just set it much higher than you'll need (as I did above)?
>
> Many thanks from a Mahout noob!
>
> Andrew.
>


Mime
View raw message