mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Question about LDA output file
Date Mon, 07 Mar 2011 17:54:25 GMT
Our LDA implementation is not completely integrated with the other clustering applications.
In particular, it does not support the clustering (classification) step which you require.
Most of the other clustering apps can do this and produce output that is similar to the gamma
matrix you noted. It is not in Mahout DRM format - this could be done - but is a sequence
file [key=topicId; value ={WeightedVectorWritable}]. In this output, the weight of the WVW
is the pdf() that the vector belongs to the topic. For maximum-likelihood clustering, this
value is 1 but for Dirichlet and FuzzyK it is a double.

If you want to offer a patch for LDA that adds this classification step it would be most appreciated.

-----Original Message-----
From: Bae, Jae Hyeon [mailto:metacret@gmail.com] 
Sent: Sunday, March 06, 2011 10:35 AM
To: user@mahout.apache.org
Subject: Question about LDA output file

Hi

A few days ago, I asked about how to recover document IDs from LDA topic
clusters, but nobody answered. :(

While I was studying about LDA, some implementation of LDA can output .gamma
file regarding which documents are mainly arguing about specific topics.
I've quoted explanation about .gamma file as the following:

".gamma file: This file includes document × topic matrix. It contains
variational posterior Dirichlets distributions. This file can be used to
find
document-based results, such as finding main topics of a document, or
finding
the top documents that are most related to a specific topic. In the
experiments
of this thesis, .gamma file is used to find out the distribution of
documents to
topics."

Unfortunately, the author didn't mention about its implementation.

Is there any way to generate output similar to .gamma file described above?
Actually, to find out relationship between K-means clustering and LDA, I
applied LDA with the number of topics as 1 to clustered document set
generated by K-means clustering, and drew containment diagram between
K-means clustering LDA topical word list and topical words list generated by
LDA applied to whole corpus. But if we can get .gamma file, it would make
Mahout LDA much stronger.

Best, Jay

Mime
View raw message