mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <andrew.schlaik...@gmail.com>
Subject Re: cvb/lda run time
Date Thu, 31 Jan 2013 02:49:43 GMT
I assume you mean input *matrix* with 600,000 doc-term *vectors*.

You need to ensure these vectors are split evenly across many part files.
The number of part files will determine input splits and in turn map-side
parallelism.

Could you let us know how much input each of your 70 mappers is processing?
Is there an imbalance?

Andy


On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).
>
> I'm running it with
> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
> cvb \
> -i /lda/matrix-converted/matrix \
> -o 's3n://.../lda/results \
> -dict /lda/dictionary.file-0 \
> -dt s3n://.../lda/doc-topics \
> -k 10 -x 10
>
> The dictionary has around 1,000,000 terms
> The input vector has around 600,000 documents (It's a 70MB file) with
> 10-100 terms in them.
> I created with the matrix file with a block size of 1MB. Each iteration of
> CVB is using 70 mappers and takes close to an hour for each mapper to run.
>
> Is this expected performance under these conditions? Are there any
> parameters I can tune?
>
> David

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message