mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)" <ankur...@cisco.com>
Subject RE: Mahout CVB parameters
Date Wed, 12 Jun 2013 18:02:07 GMT
Hi Ted,

My assumption is that there are lot of concepts (keywords/tags for the document) usually present
in a single document and in 8K documents, you might find many unique concepts.  We have also
done some analysis by manually going over about 100 documents and have identified more than
50 concepts.

Thanks,
Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, June 12, 2013 10:57 AM
To: user@mahout.apache.org
Subject: Re: Mahout CVB parameters

Why does document concept require such a large K?




On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED
at Cisco) <ankurdes@cisco.com> wrote:

> Thanks Jake for your response.  I am trying to get concepts out of the 
> documents and for this I want the K to be large around 500.  I will 
> run CVB based on your suggestions and see what I get.  Appreciate your 
> prompt response.
>
> -Ankur
>
> -----Original Message-----
> From: Jake Mannix [mailto:jake.mannix@gmail.com]
> Sent: Wednesday, June 12, 2013 9:22 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout CVB parameters
>
> What is the number of terms in your dictionary, after tokenization and 
> vectorization?  Typically, for english, you'll get reasonable topics 
> with anywhere from 20-200 topics, tending toward the lower end if 
> you've not got very many documents (like in your case)  20 topics will 
> yield very generic things, 100 is pretty nice, a lot of the time, but 
> 200 or more can lead to really niche things (I've found things like 
> getting one topic to be basically all female first names, for example).
>
> Maximum # of iterations I'd say that 20-30 tends to always be enough, 
> but while you're running it, it should be spitting out the perplexity 
> as it goes (you can tell it to calculate this every N iterations, and 
> set N to 1 to check after each iteration, while you're trying to see how it goes).
>  When this perplexity plateaus, you're done.  But in practice, I've 
> never needed more than 30 iterations (less the larger your corpus is).
>
> As for the smoothing parameters, we really should have an 
> implementation of one of the various ways of finding it automatically, 
> but for now, doing a grid search over values in the range of 0.001 to 
> 0.1 while first testing things out tends to be helpful (so try (alpha, 
> beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 0.1})
>
> Hope that helps.
>
>
> On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM 
> COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
>
> > Hi,
> >
> > I am using mahout CVB to generate topics from about 8K documents.  I 
> > am struggling to determine what are some of the best parameters 
> > values
> to use?
> >  Please help, if you know best way to determine the parameter values 
> > like topic and term smoothing, max number of iterations, or total 
> > number of topics to generate.
> >
> > Thanks,
> > Ankur
> >
>
>
>
> --
>
>   -jake
>
Mime
View raw message