mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Mahout CVB parameters
Date Wed, 12 Jun 2013 16:21:58 GMT
What is the number of terms in your dictionary, after tokenization and
vectorization?  Typically, for english, you'll get reasonable topics with
anywhere from 20-200 topics, tending toward the lower end if you've not got
very many documents (like in your case)  20 topics will yield very generic
things, 100 is pretty nice, a lot of the time, but 200 or more can lead to
really niche things (I've found things like getting one topic to be
basically all female first names, for example).

Maximum # of iterations I'd say that 20-30 tends to always be enough, but
while you're running it, it should be spitting out the perplexity as it
goes (you can tell it to calculate this every N iterations, and set N to 1
to check after each iteration, while you're trying to see how it goes).
 When this perplexity plateaus, you're done.  But in practice, I've never
needed more than 30 iterations (less the larger your corpus is).

As for the smoothing parameters, we really should have an implementation of
one of the various ways of finding it automatically, but for now, doing a
grid search over values in the range of 0.001 to 0.1 while first testing
things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x
{0.001, 0.01, 0.1})

Hope that helps.


On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM COMPUTER
SERVICES LIMITED at Cisco) <ankurdes@cisco.com> wrote:

> Hi,
>
> I am using mahout CVB to generate topics from about 8K documents.  I am
> struggling to determine what are some of the best parameters values to use?
>  Please help, if you know best way to determine the parameter values like
> topic and term smoothing, max number of iterations, or total number of
> topics to generate.
>
> Thanks,
> Ankur
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message