mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.clegg+mah...@gmail.com>
Subject Calculating cosine similarity for vectors extracted from Lucene
Date Sun, 12 Jun 2011 00:29:37 GMT
Hi,

I extracted the contents of a Lucene index like so:

bin/mahout lucene.vector --dir /path/to/index/ --output
/path/to/vectors --dictOut /path/to/dict --field text --idField id
--weight TFIDF --maxDFPercent 90 --minDF 10

And then I tried to get the cosine similarity between the docs like so:

bin/mahout rowsimilarity -i /path/to/vectors -o /path/to/25nn-matrix
-s SIMILARITY_UNCENTERED_COSINE -m 25 -r 10000000

But I got this:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.IntWritable
	at org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.map(RowSimilarityJob.java:198)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

I assume this refers to the document ID or something -- since the
actual tf.idf scores will be doubles, right? Is there an easy way to
convert these on the fly, or do I need to write something to do it?

Also, another (somewhat unrelated) question... The -r param to
rowsimilarity specifies "Number of columns in the input matrix".
What's the recommended approach when you don't know this in advance?
Just set it much higher than you'll need (as I did above)?

Many thanks from a Mahout noob!

Andrew.

-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message