mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <danielh...@verizon.net>
Subject Re: Mahout: CVB: Error
Date Sun, 04 Nov 2012 23:42:55 GMT
Arni,
 
I had not formally contributed that code but it was posted before via email.
 
Here is an initial approach developed where rowid will output one "part" file for each input
"part" file processed:
 
http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E
 
And this code will enable one to spit the data up more via an optional "m" parameter that
enables one to specify how many vectors (max) to write to a part file:
 
http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821
 
These were just some quickly developed utilities written some months ago when working with
CVB.   Obviously there are other ways to split the data up.  You could also write software
to post-process rowid's Matrix output file and split it up so more mappers run.
 
Lately I have been doing more with the Mahout k-means algorithm since I wanted to be able
to cluster lots of documents in a timely manner.
 
As specified in the thread you posted below, the run-time of LDA/CVB is very susceptible
to the size of the dictionary processed.  This also affects mapper heap space requirements
where each mapper needs to store (dictionary size * k  * 8 * 2) in memory.  We also ran
into trouble before with running out of mapper heap space when "dictionary size" and/or "k"
increased a lot so we had to reconfigure hadoop for more mapper heap space (changed to 1Gb;
no big deal to do).
 
So yes depending on how much data you are clustering and dictionary size, it could take a
long time to run.  
 
Dan
 

________________________________
 From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com>
To: DAN HELM <danielhelm@verizon.net> 
Cc: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Sunday, November 4, 2012 5:44 PM
Subject: Re: Mahout: CVB: Error
  

Dan, 
 
Regarding this thread, 
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641 
 
Did you publish your modification to the rowid function enabling the splitting of Matrix files?
A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise.

 
Best, 
 
Arni 


On Nov 3, 2012, at 8:38 PM, DAN HELM <danielhelm@verizon.net> wrote: 

Arni, 
>  
>I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex
..... 
>  
>It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix ..... 
>  
>docIndex is a file generated by rowid that provides a mapping between the original sparse vector
keys (in Text format) to the Integer keys assigned by rowid. 
>  
>Dan
>
  
>
>________________________________
> From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com>
>To: "user@mahout.apache.org" <user@mahout.apache.org> 
>Sent: Saturday, November 3, 2012 6:35 PM
>Subject: Mahout: CVB: Error
> 
>Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...
>
>I have successfully executed the following steps:
>./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
>Resulting in 20 chunk files.
>
>./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer
-ow
>Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.
>
>./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
>Resulting in "docIndex" & "matrix"
>
>Now... When attempting to run the following command,
>./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict
text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
>Resulting in an error: No part files found in model path 'text_states/model-1'
>
>Can someone please point me in the right direction?
>
>Best regards,
>
>Arni
>
>
> 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message