mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liz Merkhofer <>
Subject NaN in cvb topic models after lucene2seq
Date Wed, 24 Jul 2013 21:07:15 GMT
Hello list,

I'm having some problems using cvb (now that lda is deprecated) on my
Lucene (or Solr, if you will) index. I am using Mahout 0.8.

My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything seems to
be working, until all my topics come out, with seqdumper, as NaN, like:

Key class: class Value Class: class
Key: 0: Value:

... etc. I suspect my problem is in the output of lucene2seq, which is a
folder of files 14 files called /part-m-000xx that look very much like the
text in my Lucene index and nothing like the unreadable jumble I would get
from 'seqdirectory' on an actual directory of text files.

If it helps, here's how I'm doing this:

./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
data>index -id docId -f textbody_en

./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout
--namedVector --maxDFPercent 70 --weight TF -n 2 -a

./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o /tmp/cvb/rowidout

./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30 -dict
/tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt

Any thoughts?

Thank you,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message