mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Omer <beancinemat...@gmail.com>
Subject Re: Difficulties mapping results of CVB/LDA back to corresponding vector keys
Date Fri, 25 Apr 2014 13:39:35 GMT
Suneel,

I ended up using seqdumper on the docIndex file to retrieve the mapping of
rowid -> text key. This brought me a lot closer than where I was before!

However, now I have three files (contents here:
https://gist.github.com/momer/11289002)

My first thought is to write my own map/reduce jobs to get a dataset
(key/value) which has the format: original_text_key => [term1, term2,
term3, term4]

Where the terms are selected from the topic which has the highest
probability of describing the document.

An example:

"com.soryy:http/ruby/api/rails/authentication/2014/03/16/apis-with-devise.html"
=> ["end","token","authentication","api","resource","def","devise","json","user","x"]

Is there built-in functionality to do this, or is my plan of running
another 1 or 2 map-reduce jobs the way to go?

Thank you for your help again,

Mo



On Thu, Apr 24, 2014 at 6:52 PM, Suneel Marthi <smarthi@apache.org> wrote:

> RowId creates a matrix and docIndex which r <IntWritable, vectorWritable>
> and <IntWritable, Text> respectively.
>
> Have u looked at LDAPrintTopics.java ?
>
>
> On Thu, Apr 24, 2014 at 7:32 PM, Mohammed Omer <beancinematics@gmail.com>wrote:
>
>> Good evening all.
>>
>> This is my first time working with Mahout, and I'm really excited about
>> being able to stand on the shoulders of giants, thanks to your hard work
>> on
>> the project.
>>
>> I'm 90% of the way there with my current Mahout project, but that last 10%
>> is killing me.
>>
>> Code is at https://github.com/momer/mahout_difficulties if you want to
>> skip
>> my explanation and go right to the commands I ran, etc.
>>
>> Using a Lucene index and Mahout's robust CLI, I was able to generate
>> sequence files; sparse vectors; convert those vector keys to integers; and
>> as a result, run the CVB/LDA Algorithm.
>>
>> This worked great, and I was able to dump out the p(doc|topic) and
>> p(topic|term) results; but, I'm having a tough time figuring out how to
>> use
>> the matrix generated by `mahout rowid` to map the documents and their
>> respective topic-assignments/probabilities back to their original text
>> vector keys.
>>
>> Though I'm typically a Rubyist, and having recently (last weekend)
>> read/worked through the entirety of Core Java vol 1, I'm pretty
>> comfortable
>> with Java. I am falling on my face at this last step, though.
>>
>> I appreciate the eyes and help!
>>
>> Thank you again,
>>
>> Mo
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message