mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David LaBarbera <davidlabarb...@localresponse.com>
Subject LDA Convergence
Date Thu, 21 Feb 2013 14:45:24 GMT
I've been running some performance test with the LDA algorithm and I'm unsure how to gauge
them. I ran 10 iterations each time and collected the perplexity value every 2 iterations
with test fraction set to 0.1. These were all run on an AWS cluster with 10 nodes (70 mapper,
30 reducers). I'm not sure about the memory or cpu specs. I also stored the documents on hdfs
in 1MB blocks to get some parallelization. The documents I have were very short - 10-100 words
each.  Hopefully these results are clear.

Document Count
corpus size (MB)
Topic Count
Perplexity
Dictionary Size
Runtime  (min/iteration)         
40,044	 3.2	 10	16.326, 15.418, 15.191, 15.088, 15.028	14,097	1.5
 40,044	3.2 
20 
 26.461, 24.517, 23.996, 23.805, 23.882	14,097 
 6
 40,,044	3.2 
40 
19.722, 18.185, 17.823, 17.680, 17.608	14,097 
11.5 
 40,046	 3.7	10 
19.286, 18.373, 18.092, 17.958, 17.865	98,283	 5.5
 40,046	 3.7	20 
18.574, 17.448, 17.143, 17.018, 16.940
98,283	 10.5


 


 
44,767	4 
10 
19.928, 18.815, 18.521, 18.350, 18.225	31727	 2.5
44,767	4 
20 
21.838, 20.421, 20.087, 19.963, 19.903	31727	 4.5
 616,957	58.5 
10 
14.467, 13.830, 13.583, 13.435, 13.381	151,807 
 8.5
 616,957	58.5 
20 
13.590, 12.787, 12.605, 12.522, 12.476	151,807 
 16
 616,972	 58.4	 10	14.646, 13.904, 13.646, 13.573, 13.543	 54,280	 4
 616,967	 54.1	 10	13.363, 12.634, 12.432, 12.345, 12.283	 32,101	 2.5
 616,967	 54.1	 20	13.195, 12.307, 12.065, 11.764, 11.732  	32,101 
 4.5

The question is how to interpret the results. In particular, Is there anything telling me
when to stop running LDA? I've tried running until convergence, but I've never had the patience
to see it finish. Does the perplexity give some hint to the quality of the results? In attempting
to reach convergence, I saw runs going to 200 iterations. If an iteration takes around 5.5
minutes, that's 18 hours of processing - and that doesn't include overhead between iterations.

David
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message