On Fri, May 25, 2012 at 4:25 PM, DAN HELM <danielhelm@verizon.net> wrote:
> Hi Andy,
>
> I ran this at work so don't have the data and log now but somehow I seem
> to recall log output (after the rowid step) saying there were around 90K
> terms/columns in the resulting matrix...but I would have to check next week.
>
> So, I guess the key is the jack up the map task heap space to support a
> dense matrix? So per your O(num topics * num terms) below, I guess "k 
> #topics" could also have been a culprit, in particular when I had k=200.
>
The total heapsize will need to be about (8bytes * numTopics * numTerms *
2) + some for the the rest of the mapreduce stuff, stack etc. So in your
case, that's about 288MB without any object overhead or the mapreduce bits,
so if you have 768MB heap per mapper, you should be safe.
I've got a branch out which adds a bit of sparsification to this process,
but it's not fully baked yet  we run into this too, even with 3GB heaps,
since we run with 200k+ terms and 200500 topics. Plan is to scale to 10's
of millions of terms and thousands of more sparse topics, but haven't quite
got there yet. :)
> Out of curiosity, if one were to cluster 1 million documents, what would
> be a reasonable k? I guess it depends to the nature of the data (domain)
> and application but it would seem if k is too small then the clusters would
> be way too fat and noisy.
>
For your data set, I'd guess that 100300 is totally reasonable. It's
dependent more on your numTerms than numDocs, actually. You could have
100M docs, but still require the same k, because it's the terms which are
clustering, really.
>
> Thanks.
>
>
> ________________________________
> From: Andy Schlaikjer <andrew.schlaikjer@gmail.com>
> To: user@mahout.apache.org; DAN HELM <danielhelm@verizon.net>
> Sent: Friday, May 25, 2012 7:12 PM
> Subject: Re: Error: Java heap space on mahout cvb command
>
> Hi Dan,
>
> Each map task must have enough heap to store a dense matrix O(num topics *
> num terms). Size of input documents shouldn't matter unless you've got
> really huge (sparse) term vectors.
>
> What's the size of your input vocabulary?
>
> Andy
>
>
> On Fri, May 25, 2012 at 4:07 PM, DAN HELM <danielhelm@verizon.net> wrote:
>
> > I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).
> >
> > Previously I successfully clustered the Reuter’s 21K collection. For
> that
> > case I ran algorithm for 10 iterations into 60 clusters. Now I want to
> > cluster a different 80K file test collection. Some of the documents are
> > larger than the reuters files but most are not particularly large files.
> >
> > When attempting to cluster that collection, I get a “Java heap space”
> > error at start of first iteration of the “mahout cvb” run. I wanted to
> run
> > for 4 iterations and generate 200 clusters.
> >
> > The command I ran was:
> >
> > mahout cvb –i /tmp/sparsevectorscvb –o /tmp/cvb –k 200 –ow –x 4 –dt
> > /tmp/doctopiccvb –dict /tmp/outseqdirsparsecvb/dictionary.file0 –mt
> > /tmp/topicModelState
> >
> > Right before running that command I ran the following two commands to
> > convert my sparse vectors (earlier steps not shown here) to the proper
> form
> > needed for cvb command:
> >
> > mahout rowid –i /tmp/outseqdirsparsecvb/tfvectors o
> > /tmp/sparsevectorscvb
> >
> > hadoop fs –mv /tmp/sparsevectorscvb/docIndex
> > /tmp/sparsevectorsindexcvb (note: this step was needed to move the
> > generated docIndex file out so cvb command would not blowup).
> >
> > The pertinent error log excerpt follows:
> > ....
> > ....
> > 12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
> > 12/05/25 08:47.03 INFO About to run iteration 1 of 4
> > 12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path:
> > /tmp/topicModelState/model0
> > 12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to
> > process: 1
> > 12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
> > 12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
> > 12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status :
> > FAILED
> > 12/05/25 08:47.03 Error: Java heap space
> > ....
> > ....
> >
> > I kept on lowering the number of documents to be clustered until it
> > finally worked when I had less than 10K files. I also changed the number
> > of clusters to
> > generate (k) to 40 (I don't think this was an issue). I am interested in
> > being able to cluster very large sets with CVB (possibly hundreds of
> > thousands of files (or more)) so hope cvb can scale to that.
> >
> > I ran the above on a 3 node cluster.
> >
> > Thanks, Dan
>

jake
