mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Mahout CVB parameters
Date Wed, 12 Jun 2013 17:57:20 GMT
Why does document concept require such a large K?




On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER
SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:

> Thanks Jake for your response.  I am trying to get concepts out of the
> documents and for this I want the K to be large around 500.  I will run CVB
> based on your suggestions and see what I get.  Appreciate your prompt
> response.
>
> -Ankur
>
> -----Original Message-----
> From: Jake Mannix [mailto:jake.mannix@gmail.com]
> Sent: Wednesday, June 12, 2013 9:22 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout CVB parameters
>
> What is the number of terms in your dictionary, after tokenization and
> vectorization?  Typically, for english, you'll get reasonable topics with
> anywhere from 20-200 topics, tending toward the lower end if you've not got
> very many documents (like in your case)  20 topics will yield very generic
> things, 100 is pretty nice, a lot of the time, but 200 or more can lead to
> really niche things (I've found things like getting one topic to be
> basically all female first names, for example).
>
> Maximum # of iterations I'd say that 20-30 tends to always be enough, but
> while you're running it, it should be spitting out the perplexity as it
> goes (you can tell it to calculate this every N iterations, and set N to 1
> to check after each iteration, while you're trying to see how it goes).
>  When this perplexity plateaus, you're done.  But in practice, I've never
> needed more than 30 iterations (less the larger your corpus is).
>
> As for the smoothing parameters, we really should have an implementation
> of one of the various ways of finding it automatically, but for now, doing
> a grid search over values in the range of 0.001 to 0.1 while first testing
> things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x
> {0.001, 0.01, 0.1})
>
> Hope that helps.
>
>
> On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM
> COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
>
> > Hi,
> >
> > I am using mahout CVB to generate topics from about 8K documents.  I
> > am struggling to determine what are some of the best parameters values
> to use?
> >  Please help, if you know best way to determine the parameter values
> > like topic and term smoothing, max number of iterations, or total
> > number of topics to generate.
> >
> > Thanks,
> > Ankur
> >
>
>
>
> --
>
>   -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message