mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Pay <>
Subject Re: Interpreting the results of LDA CVB
Date Thu, 31 Jan 2013 14:54:49 GMT
So the bug I found results in the document topic model being trained on a random matrix as
opposed to the final (term|topic probability) distributions. Unless a bug fix has been released
this happens in all cases. At least for me.
The result of which is a random (document|topic) model, with more or less uniform distributions.
The term topic model works fine.
As far as I can see this should be the case with everyone using the Hadoop distributed version
unless a bug fix has been released.

This looks like the output from the (topic | document) distribution (due to the vectors being
of size 10 and there being 10 topics) with the dictionary applied (which you should not do),
 not the (term | topic) distribution.
This will therefore be uniform due to the bug.

I will hopefully have posted a patch by the end of today as I am working on it now.


On 31 Jan 2013, at 14:37, Jake Mannix wrote:

> Hi Thilina,
>  The flag you missed on your vectordump commandline is the "--sort"
> option, which sorts the results before taking the top k.  Try that and send
> us what that looks like?  It should be much easier to interpret.
> On Mon, Jan 7, 2013 at 7:19 AM, Thilina Gunarathne <>wrote:
>> Dear All,
>> I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news
>> data set, as a sample for an Hadoop publications we are working on.  I need
>> some help in understanding the Maout output to figure out the topics.
>> I ran the following commands on the TF vectors generated using seq2sparse
>> command.
>>> bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
>>> bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10  -x 20  -dict
>> 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model
>> After that I dumped the results using the vectordump as follows.
>>> bin/mahout vectordump -i lda-topics/part-m-00000 --dictionary
>> 20news-tf/dictionary.file-0 --vectorSize 10  -dt sequencefile
>> ......
>> {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
>> {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
>> .......
>> It would be great if someone can help me to interpret the above results.
>> The probability values seems to be more or less similar in all the cases.
>> Is it due to the smaller size of the dataset?
>> thanks,
>> Thilina
>> --
> -- 
>  -jake

View raw message