mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arni Sumarlidason <Arni.Sumarlida...@mdaus.com>
Subject RE: Mahout: CVB: Error
Date Tue, 06 Nov 2012 14:39:53 GMT
Dan,

Thank you for your time, patience, and detailed response.

Another question; about the results I’m receiving, I don’t understand them :(

I’ve run this command: ./mahout cvb -i /user/root/sparse-vectors-cvb/matrix -o text_lda_sr
-k 100 -x 1 -dict text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr
Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d text_vec/dictionary.file-0
-dt sequencefile -o lda-cvb-topics.txt

I get a text file with term frequencies, but I get one line per document I originally created
vectors from, not the 100 topics? I’m I doing something wrong?

Thank you for your help,


From: DAN HELM [mailto:danielhelm@verizon.net]
Sent: Sunday, November 04, 2012 6:43 PM
To: Arni Sumarlidason
Cc: user@mahout.apache.org
Subject: Re: Mahout: CVB: Error

Arni,

I had not formally contributed that code but it was posted before via email.

Here is an initial approach developed where rowid will output one "part" file for each input
"part" file processed:

http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E<https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF>

And this code will enable one to spit the data up more via an optional "m" parameter that
enables one to specify how many vectors (max) to write to a part file:

http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le>

These were just some quickly developed utilities written some months ago when working with
CVB.   Obviously there are other ways to split the data up.  You could also write software
to post-process rowid's Matrix output file and split it up so more mappers run.

Lately I have been doing more with the Mahout k-means algorithm since I wanted to be able
to cluster lots of documents in a timely manner.

As specified in the thread you posted below, the run-time of LDA/CVB is very susceptible to
the size of the dictionary processed.  This also affects mapper heap space requirements where
each mapper needs to store (dictionary size * k  * 8 * 2) in memory.  We also ran into trouble
before with running out of mapper heap space when "dictionary size" and/or "k" increased a
lot so we had to reconfigure hadoop for more mapper heap space (changed to 1Gb; no big deal
to do).

So yes depending on how much data you are clustering and dictionary size, it could take a
long time to run.

Dan

From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:Arni.Sumarlidason@mdaus.com>>
To: DAN HELM <danielhelm@verizon.net<mailto:danielhelm@verizon.net>>
Cc: "user@mahout.apache.org<mailto:user@mahout.apache.org>" <user@mahout.apache.org<mailto:user@mahout.apache.org>>
Sent: Sunday, November 4, 2012 5:44 PM
Subject: Re: Mahout: CVB: Error

Dan,

Regarding this thread,
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641

Did you publish your modification to the rowid function enabling the splitting of Matrix files?
A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise.

Best,

Arni

On Nov 3, 2012, at 8:38 PM, DAN HELM <danielhelm@verizon.net<mailto:danielhelm@verizon.net>>
wrote:


Arni,

I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex
.....

It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....

docIndex is a file generated by rowid that provides a mapping between the original sparse
vector keys (in Text format) to the Integer keys assigned by rowid.

Dan

From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:Arni.Sumarlidason@mdaus.com>>
To: "user@mahout.apache.org<mailto:user@mahout.apache.org>" <user@mahout.apache.org<mailto:user@mahout.apache.org>>
Sent: Saturday, November 3, 2012 6:35 PM
Subject: Mahout: CVB: Error

Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer
-ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0
-dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message