Why does document concept require such a large K?
On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER
SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
> Thanks Jake for your response. I am trying to get concepts out of the
> documents and for this I want the K to be large around 500. I will run CVB
> based on your suggestions and see what I get. Appreciate your prompt
> response.
>
> -Ankur
>
> -----Original Message-----
> From: Jake Mannix [mailto:jake.mannix@gmail.com]
> Sent: Wednesday, June 12, 2013 9:22 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout CVB parameters
>
> What is the number of terms in your dictionary, after tokenization and
> vectorization? Typically, for english, you'll get reasonable topics with
> anywhere from 20-200 topics, tending toward the lower end if you've not got
> very many documents (like in your case) 20 topics will yield very generic
> things, 100 is pretty nice, a lot of the time, but 200 or more can lead to
> really niche things (I've found things like getting one topic to be
> basically all female first names, for example).
>
> Maximum # of iterations I'd say that 20-30 tends to always be enough, but
> while you're running it, it should be spitting out the perplexity as it
> goes (you can tell it to calculate this every N iterations, and set N to 1
> to check after each iteration, while you're trying to see how it goes).
> When this perplexity plateaus, you're done. But in practice, I've never
> needed more than 30 iterations (less the larger your corpus is).
>
> As for the smoothing parameters, we really should have an implementation
> of one of the various ways of finding it automatically, but for now, doing
> a grid search over values in the range of 0.001 to 0.1 while first testing
> things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x
> {0.001, 0.01, 0.1})
>
> Hope that helps.
>
>
> On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM
> COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
>
> > Hi,
> >
> > I am using mahout CVB to generate topics from about 8K documents. I
> > am struggling to determine what are some of the best parameters values
> to use?
> > Please help, if you know best way to determine the parameter values
> > like topic and term smoothing, max number of iterations, or total
> > number of topics to generate.
> >
> > Thanks,
> > Ankur
> >
>
>
>
> --
>
> -jake
>
|