mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Ceccarelli <diego.ceccare...@gmail.com>
Subject Using LDA in Mahout 0.0.7
Date Sun, 28 Oct 2012 21:21:17 GMT
Dear all,

I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles. 
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file. 
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents,
each one with a tweet. So I wrote a small java app that takes a file where each line 
is a document and creates a sequence file  <Text,Text>  containing the id (line number)

and the tweet. 
Then  I used seq2sparse util:

./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a
org.apache.lucene.analysis.WhitespaceAnalyzer -ow

and I created the vectors. (it succeeded without problems)

Now, I discovered that lda now it's called cvb (why did you change the name? is 
a bit confusing.. ) so I tried to run the command, but I got this error
 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
(full stack trace here [3])

I also tried the local version:

./bin/mahout cvb0_local -i /tmp/vector/tf-vectors   -d /tmp/vector/dictionary.file-0 --numTopics
100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic

(why the parameters' names are different???) 
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast
to java.lang.String
(full stack trace here [4])

Where i'm wrong?? could please help me? 
Thanks 
Diego

[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC
Mime
View raw message