mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Mahout: CVB: Error
Date Tue, 06 Nov 2012 15:38:02 GMT
On Tue, Nov 6, 2012 at 6:39 AM, Arni Sumarlidason <
Arni.Sumarlidason@mdaus.com> wrote:

> Dan,
>
> Thank you for your time, patience, and detailed response.
>
> Another question; about the results I’m receiving, I don’t understand them
> :(
>
> I’ve run this command: ./mahout cvb -i
> /user/root/sparse-vectors-cvb/matrix -o text_lda_sr -k 100 -x 1 -dict
> text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr
> Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d
> text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt
>
> I get a text file with term frequencies, but I get one line per document I
> originally created vectors from, not the 100 topics? I’m I doing something
> wrong?
>

./mahout vectordump

wants to take in vector files: you can give it the text inputs you started
with (text_cvb_document_sr, in your case), and you'll just see the
"bag-of-words" representation of your input docs.  If you give it one of
the "model" files (in text_lda_sr), then you'll get the term distributions
for the topics.




>
> Thank you for your help,
>
>
> From: DAN HELM [mailto:danielhelm@verizon.net]
> Sent: Sunday, November 04, 2012 6:43 PM
> To: Arni Sumarlidason
> Cc: user@mahout.apache.org
> Subject: Re: Mahout: CVB: Error
>
> Arni,
>
> I had not formally contributed that code but it was posted before via
> email.
>
> Here is an initial approach developed where rowid will output one "part"
> file for each input "part" file processed:
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E
> <
> https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF
> >
>
> And this code will enable one to spit the data up more via an optional "m"
> parameter that enables one to specify how many vectors (max) to write to a
> part file:
>
> http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<
> https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le
> >
>
> These were just some quickly developed utilities written some months ago
> when working with CVB.   Obviously there are other ways to split the data
> up.  You could also write software to post-process rowid's Matrix output
> file and split it up so more mappers run.
>
> Lately I have been doing more with the Mahout k-means algorithm since I
> wanted to be able to cluster lots of documents in a timely manner.
>
> As specified in the thread you posted below, the run-time of LDA/CVB is
> very susceptible to the size of the dictionary processed.  This also
> affects mapper heap space requirements where each mapper needs to store
> (dictionary size * k  * 8 * 2) in memory.  We also ran into trouble before
> with running out of mapper heap space when "dictionary size" and/or "k"
> increased a lot so we had to reconfigure hadoop for more mapper heap space
> (changed to 1Gb; no big deal to do).
>
> So yes depending on how much data you are clustering and dictionary size,
> it could take a long time to run.
>
> Dan
>
> From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:
> Arni.Sumarlidason@mdaus.com>>
> To: DAN HELM <danielhelm@verizon.net<mailto:danielhelm@verizon.net>>
> Cc: "user@mahout.apache.org<mailto:user@mahout.apache.org>" <
> user@mahout.apache.org<mailto:user@mahout.apache.org>>
> Sent: Sunday, November 4, 2012 5:44 PM
> Subject: Re: Mahout: CVB: Error
>
> Dan,
>
> Regarding this thread,
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13641
>
> Did you publish your modification to the rowid function enabling the
> splitting of Matrix files? A single pass on my data takes 9 hours. Does
> this sound reasonable to you? please advise.
>
> Best,
>
> Arni
>
> On Nov 3, 2012, at 8:38 PM, DAN HELM <danielhelm@verizon.net<mailto:
> danielhelm@verizon.net>> wrote:
>
>
> Arni,
>
> I believe you are running with the wrong input for the cvb command:
> ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex .....
>
> It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....
>
> docIndex is a file generated by rowid that provides a mapping between the
> original sparse vector keys (in Text format) to the Integer keys assigned
> by rowid.
>
> Dan
>
> From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:
> Arni.Sumarlidason@mdaus.com>>
> To: "user@mahout.apache.org<mailto:user@mahout.apache.org>" <
> user@mahout.apache.org<mailto:user@mahout.apache.org>>
> Sent: Saturday, November 3, 2012 6:35 PM
> Subject: Mahout: CVB: Error
>
> Good Evening, Thank you for reading.. I am trying to run CVB on mahout
> 0.8...
>
> I have successfully executed the following steps:
> ./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8
> -ow -chunk 8
> Resulting in 20 chunk files.
>
> ./mahout seq2sparse -i text_seq -o text_vec -wt tf -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.
>
> ./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
> Resulting in "docIndex" & "matrix"
>
> Now... When attempting to run the following command,
> ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100
> -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
> Resulting in an error: No part files found in model path
> 'text_states/model-1'
>
> Can someone please point me in the right direction?
>
> Best regards,
>
> Arni
>
>
>
>
>
>


-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message