mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liz Merkhofer <lmerkho...@bericotechnologies.com>
Subject NaN in cvb topic models after lucene2seq
Date Wed, 24 Jul 2013 21:07:15 GMT
Hello list,

I'm having some problems using cvb (now that lda is deprecated) on my
Lucene (or Solr, if you will) index. I am using Mahout 0.8.

My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything seems to
be working, until all my topics come out, with seqdumper, as NaN, like:

Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,

... etc. I suspect my problem is in the output of lucene2seq, which is a
folder of files 14 files called /part-m-000xx that look very much like the
text in my Lucene index and nothing like the unreadable jumble I would get
from 'seqdirectory' on an actual directory of text files.

If it helps, here's how I'm doing this:

./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
data>index -id docId -f textbody_en

./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout
--namedVector --maxDFPercent 70 --weight TF -n 2 -a
org.apache.lucene.analysis.core.WhitespaceAnalyzer

./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o /tmp/cvb/rowidout

./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30 -dict
/tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
/tmp/cvb/model

Any thoughts?

Thank you,
Liz

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message