mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benoit Mathieu>
Subject LDA with custom vectors
Date Mon, 04 Mar 2013 12:00:15 GMT
Hi mahout users,

I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
"terms". Documents are very sparse, each of them contains only 100 terms.
I'd like to extract "topics" from that.

I have generated mahout vectors from my data using a simple java program,
and using RandomAccessSparseVector.

I successfully launched the "mahout cvb with" job with num_topics=200, but
the job seems very slow: 70 running map tasks took 10mn to process about
25000 documents on my cluster.

So my questions are:
- Does this job require specific Vector class for good performance ?
- Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
terms ?

Thanks for any insights.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message