mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Pay <jp...@sussex.ac.uk>
Subject Re: cvb/lda run time
Date Thu, 31 Jan 2013 08:19:14 GMT
On a related I note I believe I have found a bug in the cvb implementation and wish to know
how to go about getting it fixed. How do I go about doing this?

Sent from my iPad

On 31 Jan 2013, at 02:50, "Andy Schlaikjer" <andrew.schlaikjer@gmail.com> wrote:

> I assume you mean input *matrix* with 600,000 doc-term *vectors*.
> 
> You need to ensure these vectors are split evenly across many part files.
> The number of part files will determine input splits and in turn map-side
> parallelism.
> 
> Could you let us know how much input each of your 70 mappers is processing?
> Is there an imbalance?
> 
> Andy
> 
> 
> On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
> 
>> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).
>> 
>> I'm running it with
>> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
>> cvb \
>> -i /lda/matrix-converted/matrix \
>> -o 's3n://.../lda/results \
>> -dict /lda/dictionary.file-0 \
>> -dt s3n://.../lda/doc-topics \
>> -k 10 -x 10
>> 
>> The dictionary has around 1,000,000 terms
>> The input vector has around 600,000 documents (It's a 70MB file) with
>> 10-100 terms in them.
>> I created with the matrix file with a block size of 1MB. Each iteration of
>> CVB is using 70 mappers and takes close to an hour for each mapper to run.
>> 
>> Is this expected performance under these conditions? Are there any
>> parameters I can tune?
>> 
>> David

Mime
View raw message