mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: running lda on test dataset
Date Sun, 23 Sep 2012 04:19:27 GMT
On Sat, Sep 22, 2012 at 12:49 PM, chyi-kwei yau <chyikwei.yau@gmail.com>wrote:

> Hi,
> You should be able to run inference on a test data set.
> And use perplexity of the test set to measure the performance of your
> model.
>
> Check the LDA paper here and see the detail:
> http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf


The current LDA implementation in Mahout has a command-line option:

  --test_set_percentage

to hold out some of your training data as a "test set" which is used to
measure
held-out perplexity during training.  The command-line option:

 --iteration_block_size

sets the training to compute held-out perplexity after this many iterations
(so
if you set this to 10 then held-out perplexity is only computed ever 10
iterations over the input data).

The perplexity is logged to the console during training, and is also
persisted
in sequence files parallel with the model files (in a directory like
$OUTPUT_DIR/perplexity-$ITERATION_NUMBER or something like that).

So this will tell you how well converged you are, and how likely your test
data would be to have been generated by your model, if that is a test
you'd find useful.



>
>
> Best,
> Chyi-Kwei
>
> On Sat, Sep 22, 2012 at 2:51 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > What would you want a test to tell you?  LDA is unsupervised, so it'll
> give
> > you the word-topic probabilities, and for each test document (or training
> > document) you can get the document-topic probabilities as well.  Then...
> > what would you like to know at that point?
> >
> > On Sat, Sep 22, 2012 at 10:00 AM, vineeth <vineethrakesh@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> I am searching for how to run mahout LDA on test data set to detect the
> >> topics. Is there a way to test the trained lda model? or should we write
> >> our own program based on the word-topic probabilities that the LDA spits
> >> out after running on the test data?
> >>
> >> Thanks
> >> Vineeth
> >>
> >
> >
> >
> > --
> >
> >   -jake
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message