mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liz Merkhofer <lmerkho...@bericotechnologies.com>
Subject Re: NaN in cvb topic models after lucene2seq
Date Thu, 25 Jul 2013 20:28:01 GMT
Thanks for your help again. Logged a JIRA, then made a last-ditch effort to
check my commands and realized that the field I wanted to use for -id was
typed wrong (camel case instead of lower). Fixed it and was able to get the
appropriate number of rows in my matrix. Still waiting for cvb output from
that, but I'll wrap up this thread since the problem with lucene2seq boils
down to user error.

So the problem was on my end: what I entered as -id did not exist in my
Solr schema and so my documents were not delimited.

Sorry for the false alarm; thanks for your helpfulness.



On Thu, Jul 25, 2013 at 12:46 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> Agree with Jake that this is definitely an issue with lucene2seq.
>
> RowId should have created a matrix with 70000 rows (= no. of documents
> from your input corpus), but seems like lucene2seq is creating one single
> document
> for all of them.
>
> Could you log a JIRA for this?
>
> Thanks again for reporting this.
>
>
>
> ________________________________
>  From: Jake Mannix <jake.mannix@gmail.com>
> To: "user@mahout.apache.org" <user@mahout.apache.org>
> Sent: Thursday, July 25, 2013 12:39 PM
> Subject: Re: NaN in cvb topic models after lucene2seq
>
>
> On Thu, Jul 25, 2013 at 9:07 AM, Liz Merkhofer <
> lmerkhofer@bericotechnologies.com> wrote:
>
> > Thanks so much for your response, Suneel.
> >
> > Unfortunately, the Solr index is not mine to post. But short of that, are
> > there any useful answers I can provide? At the time I ran this, it
> > contained 70,000 documents... I'm adding several times that today,
> though.
> >
> > I tried lucene2seq again.
> >
> > Running with the MapReduce default, the directory it creates contains
> > _SUCCESS part-m-00003 part-m-00007 part-m-00011
> > part-m-00000 part-m-00004 part-m-00008 part-m-00012
> > part-m-00001 part-m-00005 part-m-00009 part-m-00013
> > part-m-00002 part-m-00006 part-m-00010 part-m-00014
> >
> > With -xm sequential, however, it creates only "index."
> >
> > Looking at part-m-00014 or index, I see about the same thing: a header
> like
> >
> >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%(
> >
> > And then the concatenated text of (all?) my documents
> >
> >
> This is definitely the problem:
>
>
> > When I run "rowid," I get
> >
> > 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows and
> > 465540 columns to /tmp/cvb/rowidout/matrix
> >
>
> >
> > In comparison, I'm working off the closest example I could find, from the
> > book Hadoop MapReduce Cookbook (page in Safari Books Online:
> > http://goo.gl/n3YVCz). Running seqdirectory on their sample, a directory
> > containing data from 20 newsgroups, my output is called part-m-00000 and
> > looks like
> >
> >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87>
> >
> > etc. When that gets to the point of running rowid, I get
> >
> > 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997 rows
> > and 193659 columns to tmp/20news/int/matrix
> >
> > where those aprox 20,000 rows are plausibly each a document in the 20news
> > dataset.
> >
> > It seems then, to me, that lucene2seq is the culprit.
>
>
> Yep, that looks to be the case.
>
>
> > Maybe the best
> > solution will falling back on lucene.vector:
> >
> > ./mahout lucene.vector --dir <path to solr data>/index --output
> > /tmp/lv-cvb/luceneout --field textbody_en --dictOut
> /tmp/lv-cvb/lucenedict
> > --idField docid --norm 2 --weight TF --seqDictOut /tmp/lv-cvb/seqDictOut
> > --norm 2 -x 70
> >
> > The output did look like the appropriately garbled.
> >
> > However, rowid doesn't like the output from lucene.vector,
> > "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> to
> > org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping
> > rowid also had a problem with the LongWriteable,
> > "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be
> > cast to org.apache.hadoop.io.IntWritable."
> >
>
> That's very sad to see.  lucene.vector is spitting out sequence files which
> have keys being LongWritable, seq2sparse is spitting out sequence files
> which have Text keys, and LDA wants inputs which are IntWritable keys.
> RowId alleviates only one problem: taking Text keys and turning them into
> IntWritable keys.
>
> I would be very sad if it turns out your only current option is to write a
> trivially
> changed version of RowId (it's a really simple job) which can handle
> LongWritable
> keys as well as Text.  In fact, it would be a great modification for that
> job to be
> changed to take *any* key type.  It currently doesn't care what its keys
> are,
> so it should be pretty easy to change all instances of "Text" in RowIdJob
> to
> "WritableComparable" (or ? extends WritableComparable) and it should "just
> work".  Lame!
>
>
> >
> > My commands:
> > ./mahout rowid -i /tmp/lv-cvb/luceneout  -o /tmp/lv-cvb/matrix
> >
> > ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10
> -dict
> > /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model
> >
> > Is there something I'm missing?
> >
> > Thank you,
> > Liz
> >
> >
> > On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <suneel_marthi@yahoo.com
> > >wrote:
> >
> > > Liz,
> > >
> > > lucene2seq was a recent addition to Mahout 0.8 and its good that you
> are
> > > taking this for a test drive and reporting issues.
> > > In order to troubleshoot this:
> > >
> > > a) Could you try running lucene2seq with a '-xm sequential' option and
> > > verify the output?  The default option now is MapReduce and I am trying
> > to
> > > determine
> > >  if the issue could be with the MapReduce version or if its something
> > more
> > > basic.
> > > b) Is it possible for you to post your Solr index to these forums, I
> can
> > > take a stab at this to see as to what's wrong.
> > >
> > > Suneel
> > >
> > >
> > >
> > >
> > > ________________________________
> > >  From: Liz Merkhofer <lmerkhofer@bericotechnologies.com>
> > > To: user@mahout.apache.org
> > > Sent: Wednesday, July 24, 2013 5:07 PM
> > > Subject: NaN in cvb topic models after lucene2seq
> > >
> > >
> > > Hello list,
> > >
> > > I'm having some problems using cvb (now that lda is deprecated) on my
> > > Lucene (or Solr, if you will) index. I am using Mahout 0.8.
> > >
> > > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything
> seems
> > to
> > > be working, until all my topics come out, with seqdumper, as NaN, like:
> > >
> > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > > org.apache.mahout.math.VectorWritable
> > > Key: 0: Value:
> > >
> > >
> >
> {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,
> > >
> > > ... etc. I suspect my problem is in the output of lucene2seq, which is
> a
> > > folder of files 14 files called /part-m-000xx that look very much like
> > the
> > > text in my Lucene index and nothing like the unreadable jumble I would
> > get
> > > from 'seqdirectory' on an actual directory of text files.
> > >
> > > If it helps, here's how I'm doing this:
> > >
> > > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
> > > data>index -id docId -f textbody_en
> > >
> > > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout
> > > --namedVector --maxDFPercent 70 --weight TF -n 2 -a
> > > org.apache.lucene.analysis.core.WhitespaceAnalyzer
> > >
> > > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o
> /tmp/cvb/rowidout
> > >
> > > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30
> > -dict
> > > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
> > > /tmp/cvb/model
> > >
> > > Any thoughts?
> > >
> > > Thank you,
> > > Liz
> > >
> >
>
>
>
> --
>
>   -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message