mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Command line : Error using clusterdump after cvb (0.7)
Date Thu, 15 Nov 2012 19:02:26 GMT
On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez <jeremie.gomez@gmail.com>wrote:

> Thanks a lot Jake,
>
> I have tried using the vectordump job to retrieve the topics in text
> format, and obtained a text document stating all the terms in the
> dictionary file and numerical values, which I could not successfully
> interpret. My commands were the following:
>
> 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
> seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1
>
> 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
> seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
> --vectorSize 5
>
>
> I'm guessing this might be due to the lack of "-sort" command,


Yeah, you won't be able to interpret *at all* without sort - you'll just get
the first few terms for the topic, in no order at all (i.e. maybe ones
which are not likely in that topic at all, but have probability > 0).

Another thing: you're using temp/model-1 - sounds like you're looking
at your *first* iteration of the output?  That's nowhere near convergence,
and your topics will look like garbage - you need to take at least iteration
10 or 20 to see some good topics.

but I can't
> use the -sort command because of a heap memory problem that I can't fix by
> changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem
> even though I am running the cvb test on a 1,3 Mo dataset...
>

So are you running on trunk?  I think -sort was broken in the last release,
but has been fixed for a few months now on subversion trunk.


>
> Thank you !
>
>
> 2012/11/14 Jake Mannix <jake.mannix@gmail.com>
>
> > Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster"
> > objects.
> >
> > If you want to look at the topics for CVB, use vectordump:
> >
> >
> > mahout vectordump -s <path to topics sequence file> --dictionary <path
to
> > dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> > per topic you
> > want to see> -sort
> >
> >
> >
> > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> > >wrote:
> >
> > > Hi everyone,
> > >
> > > I have tried several of the clustering algorithms in mahout and they
> > worked
> > > great, but I have a problem with the cvd implementation of Latent
> > Dirichlet
> > > Allocation. The cvb command works fine but then using clusterdump gives
> > me
> > > the following error :
> > >
> > > Exception in thread "main" java.lang.ClassCastException:
> > > org.apache.mahout.math.VectorWritable cannot be cast to
> > > org.apache.mahout.clustering.iterator.ClusterWritable
> > >
> > > What I do in details :
> > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md
> 1
> > -x
> > > 90 -ng 2 -ml 50 -seq -n 2
> > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > > 4) mahout mahout cvb -i rowresult/matrix -dict
> > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow
> -k
> > > 10
> > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > > marcelproust/dictionary.file-0 -dt sequencefile
> > >
> > > When I run command 5, I get the error above. Unfortunately, I could not
> > > find any working solution after searching the archives, so I though I'd
> > ask
> > > the community !
> > >
> > > Thanks a lot in advance.
> > > Jeremie
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message