mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runkel, Timothy J" <timothy.j.run...@lmco.com>
Subject Using LDA CVB results to match a new document to topics?
Date Wed, 30 May 2012 00:03:53 GMT
Jake,



We've run the new LDA CVB implementation (using the RowID job to format docs as you noted
in other email) and have complete results.



Now, given the topic by terms association vectors, how can we take a new document (in Term
Frequency format using the same dictionary as the trained documents and ignoring any few terms
not found) and query the model to rank its top topic matches?  Academic papers seem to gloss
over this task.



An input TF vector is not the same thing as the topic by terms association vectors, but hoping
it was analogous enough, I tried several similarity or distance measures between some trained
doc TF vectors and the topic by term association vectors, but the calculated top topic matches
did not even rank in approximately the same order as the model output of the doc by topic
membership rankings.  So that approach seems even less likely to match a new doc TF vector
to model topics.



Tracing the logic through the CVB classes seem to show the final model training iteration
happens in the TopicModel.trainDocTopicModel(Vector original, Vector topics, Matrix docTopicModel)
 method, but several of its steps use class values not obviously accessible in model results
and its modifications to the docTopicModel matrix seems more like tuning than a simple look
up.



Any help or pointers will be greatly appreciated.  Thank you!











Mime
View raw message