mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Mahout CVB parameters
Date Wed, 12 Jun 2013 18:15:52 GMT
LDA is not going to easily capture 500's of sensible topics from 8000
documents.  It is typically sensitive to topics in the range of a constant
times a logarithmic function of the number of unique terms in the corpus.
 If you try it with 500 topics, I will guarantee that you'll find very
weird things like "topics" like ["bob", "dave", "fred", ...], ["blue",
"magenta", "orange", ... ], ["7am", "12:30", "4pm", "midnight", ... ],
["hi", "hello", "salutations", "greetings", "whattup", ...].

But go ahead and try it with 50, 100, 200, 300, 400, 500, topics, and see
what the look like.  I doubt you'll have too much use for the topics when
you get up past 200 or so.

In general, there is a principled way to do this, where you look at the
held-out perplexity as a function of numTopics, and stop when it plateaus.
 The "D" in LDA means that this will happen at a much lower range than 500
or so.

If you want much more topics, you need a different prior, but that's way
out of scope of this thread if you're trying to do this "out of the box".


On Wed, Jun 12, 2013 at 11:02 AM, Ankur Desai -X (ankurdes - SATYAM
COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:

> Hi Ted,
>
> My assumption is that there are lot of concepts (keywords/tags for the
> document) usually present in a single document and in 8K documents, you
> might find many unique concepts.  We have also done some analysis by
> manually going over about 100 documents and have identified more than 50
> concepts.
>
> Thanks,
> Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, June 12, 2013 10:57 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout CVB parameters
>
> Why does document concept require such a large K?
>
>
>
>
> On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM
> COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
>
> > Thanks Jake for your response.  I am trying to get concepts out of the
> > documents and for this I want the K to be large around 500.  I will
> > run CVB based on your suggestions and see what I get.  Appreciate your
> > prompt response.
> >
> > -Ankur
> >
> > -----Original Message-----
> > From: Jake Mannix [mailto:jake.mannix@gmail.com]
> > Sent: Wednesday, June 12, 2013 9:22 AM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout CVB parameters
> >
> > What is the number of terms in your dictionary, after tokenization and
> > vectorization?  Typically, for english, you'll get reasonable topics
> > with anywhere from 20-200 topics, tending toward the lower end if
> > you've not got very many documents (like in your case)  20 topics will
> > yield very generic things, 100 is pretty nice, a lot of the time, but
> > 200 or more can lead to really niche things (I've found things like
> > getting one topic to be basically all female first names, for example).
> >
> > Maximum # of iterations I'd say that 20-30 tends to always be enough,
> > but while you're running it, it should be spitting out the perplexity
> > as it goes (you can tell it to calculate this every N iterations, and
> > set N to 1 to check after each iteration, while you're trying to see how
> it goes).
> >  When this perplexity plateaus, you're done.  But in practice, I've
> > never needed more than 30 iterations (less the larger your corpus is).
> >
> > As for the smoothing parameters, we really should have an
> > implementation of one of the various ways of finding it automatically,
> > but for now, doing a grid search over values in the range of 0.001 to
> > 0.1 while first testing things out tends to be helpful (so try (alpha,
> > beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 0.1})
> >
> > Hope that helps.
> >
> >
> > On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM
> > COMPUTER SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:
> >
> > > Hi,
> > >
> > > I am using mahout CVB to generate topics from about 8K documents.  I
> > > am struggling to determine what are some of the best parameters
> > > values
> > to use?
> > >  Please help, if you know best way to determine the parameter values
> > > like topic and term smoothing, max number of iterations, or total
> > > number of topics to generate.
> > >
> > > Thanks,
> > > Ankur
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message