mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <andrew.schlaik...@gmail.com>
Subject Re: Error: Java heap space on mahout cvb command
Date Fri, 25 May 2012 23:12:51 GMT
Hi Dan,

Each map task must have enough heap to store a dense matrix O(num topics *
num terms). Size of input documents shouldn't matter unless you've got
really huge (sparse) term vectors.

What's the size of your input vocabulary?

Andy


On Fri, May 25, 2012 at 4:07 PM, DAN HELM <danielhelm@verizon.net> wrote:

> I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).
>
> Previously I successfully clustered the Reuter’s 21K collection.  For that
> case I ran algorithm for 10 iterations into 60 clusters. Now I want to
> cluster a different 80K file test collection.  Some of the documents are
> larger than the reuters files but most are not particularly large files.
>
> When attempting to cluster that collection, I get a “Java heap space”
> error at start of first iteration of the “mahout cvb” run.  I wanted to run
> for 4 iterations and generate 200 clusters.
>
> The command I ran was:
>
> mahout cvb –i /tmp/sparse-vectors-cvb –o /tmp/cvb –k 200 –ow –x 4 –dt
> /tmp/doc-topic-cvb –dict /tmp/out-seqdir-sparse-cvb/dictionary.file-0 –mt
> /tmp/topicModelState
>
> Right before running that command I ran the following two commands to
> convert my sparse vectors (earlier steps not shown here) to the proper form
> needed for cvb command:
>
> mahout rowid –i /tmp/out-seqdir-sparse-cvb/tf-vectors -o
> /tmp/sparse-vectors-cvb
>
> hadoop fs –mv /tmp/sparse-vectors-cvb/docIndex
> /tmp/sparse-vectors-index-cvb (note: this step was needed to move the
> generated docIndex file out so cvb command would not blowup).
>
> The pertinent error log excerpt follows:
> ....
> ....
> 12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
> 12/05/25 08:47.03 INFO About to run iteration 1 of 4
> 12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path:
> /tmp/topicModelState/model-0
> 12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to
> process: 1
> 12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
> 12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
> 12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status :
> FAILED
> 12/05/25 08:47.03 Error: Java heap space
> ....
> ....
>
> I kept on lowering the number of documents to be clustered until it
> finally worked when I had less than 10K files.  I also changed the number
> of clusters to
> generate (k) to 40 (I don't think this was an issue).  I am interested in
> being able to cluster very large sets with CVB (possibly hundreds of
> thousands of files (or more)) so hope cvb can scale to that.
>
> I ran the above on a 3 node cluster.
>
> Thanks, Dan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message