mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LDA Convergence
Date Thu, 21 Feb 2013 17:00:42 GMT
I really can't read your results here, the formatting of your columns is
pretty destroyed...  you look like you've got results for 20 topics, as
well as for 10, with different sized corpora?

You can't compare convergence between corpora sizes - the perplexity will
vary by order of magnitude between them.  The only thing you should be
comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
20,... iterations, what does the (held-out) perplexity look like after each
of these?  Does it start to level off?  At some point you may start
overfitting and having the perplexity go back up.  Your convergence
happened before that.

I don't think I've ever needed to run more than 50 iterations, and usually
I stop after 20-30.  The bigger the corpus, the more this becomes true.


On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> I've been running some performance test with the LDA algorithm and I'm
> unsure how to gauge them. I ran 10 iterations each time and collected the
> perplexity value every 2 iterations with test fraction set to 0.1. These
> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm
> not sure about the memory or cpu specs. I also stored the documents on hdfs
> in 1MB blocks to get some parallelization. The documents I have were very
> short - 10-100 words each.  Hopefully these results are clear.
>
> Document Count
> corpus size (MB)
> Topic Count
> Perplexity
> Dictionary Size
> Runtime  (min/iteration)
> 40,044   3.2     10     16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
>  40,044 3.2
> 20
>  26.461, 24.517, 23.996, 23.805, 23.882 14,097
>  6
>  40,,044        3.2
> 40
> 19.722, 18.185, 17.823, 17.680, 17.608  14,097
> 11.5
>  40,046  3.7    10
> 19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
>  40,046  3.7    20
> 18.574, 17.448, 17.143, 17.018, 16.940
> 98,283   10.5
>
>
>
>
>
>
> 44,767  4
> 10
> 19.928, 18.815, 18.521, 18.350, 18.225  31727    2.5
> 44,767  4
> 20
> 21.838, 20.421, 20.087, 19.963, 19.903  31727    4.5
>  616,957        58.5
> 10
> 14.467, 13.830, 13.583, 13.435, 13.381  151,807
>  8.5
>  616,957        58.5
> 20
> 13.590, 12.787, 12.605, 12.522, 12.476  151,807
>  16
>  616,972         58.4    10     14.646, 13.904, 13.646, 13.573, 13.543
> 54,280  4
>  616,967         54.1    10     13.363, 12.634, 12.432, 12.345, 12.283
> 32,101  2.5
>  616,967         54.1    20     13.195, 12.307, 12.065, 11.764, 11.732
>      32,101
>  4.5
>
> The question is how to interpret the results. In particular, Is there
> anything telling me when to stop running LDA? I've tried running until
> convergence, but I've never had the patience to see it finish. Does the
> perplexity give some hint to the quality of the results? In attempting to
> reach convergence, I saw runs going to 200 iterations. If an iteration
> takes around 5.5 minutes, that's 18 hours of processing - and that doesn't
> include overhead between iterations.
>
> David




-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message