So the bug I found results in the document topic model being trained on a random matrix as
opposed to the final (termtopic probability) distributions. Unless a bug fix has been released
this happens in all cases. At least for me.
The result of which is a random (documenttopic) model, with more or less uniform distributions.
The term topic model works fine.
As far as I can see this should be the case with everyone using the Hadoop distributed version
unless a bug fix has been released.
This looks like the output from the (topic  document) distribution (due to the vectors being
of size 10 and there being 10 topics) with the dictionary applied (which you should not do),
not the (term  topic) distribution.
This will therefore be uniform due to the bug.
I will hopefully have posted a patch by the end of today as I am working on it now.
Jack
On 31 Jan 2013, at 14:37, Jake Mannix wrote:
> Hi Thilina,
> The flag you missed on your vectordump commandline is the "sort"
> option, which sorts the results before taking the top k. Try that and send
> us what that looks like? It should be much easier to interpret.
> On Mon, Jan 7, 2013 at 7:19 AM, Thilina Gunarathne <csethil@gmail.com>wrote:
>
>> Dear All,
>> I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news
>> data set, as a sample for an Hadoop publications we are working on. I need
>> some help in understanding the Maout output to figure out the topics.
>>
>> I ran the following commands on the TF vectors generated using seq2sparse
>> command.
>>> bin/mahout rowid i 20newstf/tfvectors o 20newstfint
>>> bin/mahout cvb i 20newstfint/matrix o ldaout k 10 x 20 dict
>> 20newstf/dictionary.file0 dt ldatopics mt ldatopicmodel
>>
>> After that I dumped the results using the vectordump as follows.
>>
>>> bin/mahout vectordump i ldatopics/partm00000 dictionary
>> 20newstf/dictionary.file0 vectorSize 10 dt sequencefile
>>
>>
>> {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
>>
>> {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
>> It would be great if someone can help me to interpret the above results.
>> The probability values seems to be more or less similar in all the cases.
>> Is it due to the smaller size of the dataset?
>>
>> thanks,
>> Thilina
>>
