mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: NaN in cvb topic models after lucene2seq
Date Thu, 25 Jul 2013 16:39:57 GMT
On Thu, Jul 25, 2013 at 9:07 AM, Liz Merkhofer <
lmerkhofer@bericotechnologies.com> wrote:

> Thanks so much for your response, Suneel.
>
> Unfortunately, the Solr index is not mine to post. But short of that, are
> there any useful answers I can provide? At the time I ran this, it
> contained 70,000 documents... I'm adding several times that today, though.
>
> I tried lucene2seq again.
>
> Running with the MapReduce default, the directory it creates contains
> _SUCCESS part-m-00003 part-m-00007 part-m-00011
> part-m-00000 part-m-00004 part-m-00008 part-m-00012
> part-m-00001 part-m-00005 part-m-00009 part-m-00013
> part-m-00002 part-m-00006 part-m-00010 part-m-00014
>
> With -xm sequential, however, it creates only "index."
>
> Looking at part-m-00014 or index, I see about the same thing: a header like
>
>
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%(
>
> And then the concatenated text of (all?) my documents
>
>
This is definitely the problem:


> When I run "rowid," I get
>
> 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows and
> 465540 columns to /tmp/cvb/rowidout/matrix
>

>
> In comparison, I'm working off the closest example I could find, from the
> book Hadoop MapReduce Cookbook (page in Safari Books Online:
> http://goo.gl/n3YVCz). Running seqdirectory on their sample, a directory
> containing data from 20 newsgroups, my output is called part-m-00000 and
> looks like
>
>
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87>
>
> etc. When that gets to the point of running rowid, I get
>
> 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997 rows
> and 193659 columns to tmp/20news/int/matrix
>
> where those aprox 20,000 rows are plausibly each a document in the 20news
> dataset.
>
> It seems then, to me, that lucene2seq is the culprit.


Yep, that looks to be the case.


> Maybe the best
> solution will falling back on lucene.vector:
>
> ./mahout lucene.vector --dir <path to solr data>/index --output
> /tmp/lv-cvb/luceneout --field textbody_en --dictOut /tmp/lv-cvb/lucenedict
> --idField docid --norm 2 --weight TF --seqDictOut /tmp/lv-cvb/seqDictOut
> --norm 2 -x 70
>
> The output did look like the appropriately garbled.
>
> However, rowid doesn't like the output from lucene.vector,
> "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping
> rowid also had a problem with the LongWriteable,
> "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.IntWritable."
>

That's very sad to see.  lucene.vector is spitting out sequence files which
have keys being LongWritable, seq2sparse is spitting out sequence files
which have Text keys, and LDA wants inputs which are IntWritable keys.
RowId alleviates only one problem: taking Text keys and turning them into
IntWritable keys.

I would be very sad if it turns out your only current option is to write a
trivially
changed version of RowId (it's a really simple job) which can handle
LongWritable
keys as well as Text.  In fact, it would be a great modification for that
job to be
changed to take *any* key type.  It currently doesn't care what its keys
are,
so it should be pretty easy to change all instances of "Text" in RowIdJob to
"WritableComparable" (or ? extends WritableComparable) and it should "just
work".  Lame!


>
> My commands:
> ./mahout rowid -i /tmp/lv-cvb/luceneout  -o /tmp/lv-cvb/matrix
>
> ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10 -dict
> /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model
>
> Is there something I'm missing?
>
> Thank you,
> Liz
>
>
> On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Liz,
> >
> > lucene2seq was a recent addition to Mahout 0.8 and its good that you are
> > taking this for a test drive and reporting issues.
> > In order to troubleshoot this:
> >
> > a) Could you try running lucene2seq with a '-xm sequential' option and
> > verify the output?  The default option now is MapReduce and I am trying
> to
> > determine
> >  if the issue could be with the MapReduce version or if its something
> more
> > basic.
> > b) Is it possible for you to post your Solr index to these forums, I can
> > take a stab at this to see as to what's wrong.
> >
> > Suneel
> >
> >
> >
> >
> > ________________________________
> >  From: Liz Merkhofer <lmerkhofer@bericotechnologies.com>
> > To: user@mahout.apache.org
> > Sent: Wednesday, July 24, 2013 5:07 PM
> > Subject: NaN in cvb topic models after lucene2seq
> >
> >
> > Hello list,
> >
> > I'm having some problems using cvb (now that lda is deprecated) on my
> > Lucene (or Solr, if you will) index. I am using Mahout 0.8.
> >
> > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything seems
> to
> > be working, until all my topics come out, with seqdumper, as NaN, like:
> >
> > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > org.apache.mahout.math.VectorWritable
> > Key: 0: Value:
> >
> >
> {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,
> >
> > ... etc. I suspect my problem is in the output of lucene2seq, which is a
> > folder of files 14 files called /part-m-000xx that look very much like
> the
> > text in my Lucene index and nothing like the unreadable jumble I would
> get
> > from 'seqdirectory' on an actual directory of text files.
> >
> > If it helps, here's how I'm doing this:
> >
> > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
> > data>index -id docId -f textbody_en
> >
> > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout
> > --namedVector --maxDFPercent 70 --weight TF -n 2 -a
> > org.apache.lucene.analysis.core.WhitespaceAnalyzer
> >
> > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o /tmp/cvb/rowidout
> >
> > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30
> -dict
> > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
> > /tmp/cvb/model
> >
> > Any thoughts?
> >
> > Thank you,
> > Liz
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message