mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benoit Mathieu ...@deezer.com>
Subject Re: LDA with custom vectors
Date Mon, 04 Mar 2013 16:40:53 GMT
Here is my command line:

mahout cvb --input user_model/vectors --output user_model/output
--num_topics 200 --num_terms 28892 --dictionary user_model/dictionary
--maxIter 10

benoit




2013/3/4 Jake Mannix <jake.mannix@gmail.com>

> Can you send us your command line args? Is that for 1 iteration ?  That
> would be very very slow
>
> On Monday, March 4, 2013, Benoit Mathieu wrote:
>
> > Hi mahout users,
> >
> > I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
> > cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
> > "terms". Documents are very sparse, each of them contains only 100 terms.
> > I'd like to extract "topics" from that.
> >
> > I have generated mahout vectors from my data using a simple java program,
> > and using RandomAccessSparseVector.
> >
> > I successfully launched the "mahout cvb with" job with num_topics=200,
> but
> > the job seems very slow: 70 running map tasks took 10mn to process about
> > 25000 documents on my cluster.
> >
> > So my questions are:
> > - Does this job require specific Vector class for good performance ?
> > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
> > terms ?
> >
> > Thanks for any insights.
> >
> > ++
> > benoit
> >
>
>
> --
>
>   -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message