mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <danielh...@verizon.net>
Subject Re: Using LDA in Mahout 0.0.7
Date Sun, 28 Oct 2012 21:40:53 GMT
Hi Diego, 
A number of us had the same issue when first working with the new CVB algorithm.  The vector
keys for CVB need to be Integers.  You can use the rowid utility to convert the output from
seq2sparse to the form needed by CVB, e.g.,  
http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 
Dan  

________________________________
 From: Diego Ceccarelli <diego.ceccarelli@gmail.com>
To: user@mahout.apache.org 
Sent: Sunday, October 28, 2012 5:21 PM
Subject: Using LDA in Mahout 0.0.7
  
Dear all,

I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles. 
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file. 
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents,
each one with a tweet. So I wrote a small java app that takes a file where each line 
is a document and creates a sequence file  <Text,Text>  containing the id (line number)

and the tweet. 
Then  I used seq2sparse util:

./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a
org.apache.lucene.analysis.WhitespaceAnalyzer -ow

and I created the vectors. (it succeeded without problems)

Now, I discovered that lda now it's called cvb (why did you change the name? is 
a bit confusing.. ) so I tried to run the command, but I got this error

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
(full stack trace here [3])

I also tried the local version:

./bin/mahout cvb0_local -i /tmp/vector/tf-vectors   -d /tmp/vector/dictionary.file-0 --numTopics
100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic

(why the parameters' names are different???) 
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast
to java.lang.String
(full stack trace here [4])

Where i'm wrong?? could please help me? 
Thanks 
Diego

[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message