mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Handley <shand...@alumni.stanford.org>
Subject LDA, printing Topics
Date Fri, 18 May 2012 15:10:04 GMT
I'm trying to understand how LDA prints out the words per topic.  If I run
the reuters example, the topics are printed out like this:

Topic 0
> ===========
> dlrs [p(dlrs|topic_0) = 0.09982075792238235
> mln [p(mln|topic_0) = 0.05160370562850524
> its [p(its|topic_0) = 0.026424106119119467
> earnings [p(earnings|topic_0) = 0.01443840106489682
> first [p(first|topic_0) = 0.009974557469871507
> expects [p(expects|topic_0) = 0.009142468152407305
> from [p(from|topic_0) = 0.008524695847985245
> fiscal [p(fiscal|topic_0) = 0.004258375295028562
> fourth [p(fourth|topic_0) = 0.0030771424078847786
> dome [p(dome|topic_0) = 0.002437140573596841
> full [p(full|topic_0) = 0.002003535406532566
> future [p(future|topic_0) = 0.0015052377174063838


 [...]

If I look in LDAPrintTopics I see that it keeps a PriorityQueue of
Pair<String,Double> (word,LL) for each topic.  What I don't understand is
that the priority queue is ordered by the "natural ordering" so the word is
the primary sort index, log-likelihood is the secondary. It seems to me
that the queue should be sorted by LL (since the goal is to find the top
(by LL) 20 words per topic).  For example, if I change line 67
of LDAPrintTopics.java from

queues.add(new PriorityQueue<Pair<String,Double>>());


to

queues.add(new PriorityQueue<Pair<String,Double>>(10,newComparator<Pair<String,Double>>()
{
>
> @Override
> public int compare(Pair<String, Double> arg0, Pair<String, Double> arg1)
{
>       return Double.compare(arg0.getSecond(), arg1.getSecond());
>   }
> }));
>
>
then Topic 0 now looks like

Topic 0
> ===========
> dlrs [p(dlrs|topic_0) = 0.09982075792238235
> said [p(said|topic_0) = 0.05181051933987995
> mln [p(mln|topic_0) = 0.05160370562850524
> company [p(company|topic_0) = 0.03218582414212703
> its [p(its|topic_0) = 0.026424106119119467
> share [p(share|topic_0) = 0.02381971077608159
> quarter [p(quarter|topic_0) = 0.01598458767052625
> about [p(about|topic_0) = 0.014775061957039969
> earnings [p(earnings|topic_0) = 0.01443840106489682
> per [p(per|topic_0) = 0.011622847792849595
> 1987 [p(1987|topic_0) = 0.011292459086965701
> first [p(first|topic_0) = 0.009974557469871507

This makes more sense to me.  Am I missing something?  Thanks,

Simon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message