mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David LaBarbera <>
Subject cvb/lda run time
Date Thu, 31 Jan 2013 02:06:09 GMT
I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).

I'm running it with 
hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
cvb \
-i /lda/matrix-converted/matrix \
-o 's3n://.../lda/results \
-dict /lda/dictionary.file-0 \
-dt s3n://.../lda/doc-topics \
-k 10 -x 10

The dictionary has around 1,000,000 terms
The input vector has around 600,000 documents (It's a 70MB file) with 10-100 terms in them.

I created with the matrix file with a block size of 1MB. Each iteration of CVB is using 70
mappers and takes close to an hour for each mapper to run.

Is this expected performance under these conditions? Are there any parameters I can tune?

View raw message