mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gokhan Capan <gkhn...@gmail.com>
Subject LDA Questions
Date Mon, 06 Aug 2012 11:00:24 GMT
Hi,

My question is about interpreting lda document-topics output.

I am using trunk.

I have a directory of documents, each of which are named by integers, and
there is no sub-directory of the data directory.
The directory structure is as follows
$ ls /path/to/data/
   1
   2
   5
   ...

>From those documents I want to detect topics, and output:
- topic - top terms
- document - top topics

To this end, I first run seqdirectory on the directory:
$ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1

Then I run seq2sparse to create tf vectors of documents:
$ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
--namedVector

After creating vectors, I run cvb0_local on those tf-vectors:
$ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
$LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0

And to interpret the results, I use mahout's vectordump utility:
$ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10
-sort true -p true

$ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
$SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10
-sort true -p true

The resulting words file consists of #ofTopics lines.
I assume each line is in <topicID \t wordsVector> format, where a
wordsVector is a sorted vector whose elements are <word, score> pairs.

The resulting docs file on the other hand, consists of #ofDocuments lines.
I assume each line is in <documentID \t topicsVector> format, where a
topicsVector is a sorted vector whose elements are <topicID, probability>
pairs.

The problem is that the documentID field does not match with the original
document ids. This field is populated with zero-based auto-incrementing
indices.

I want to ask if I am missing something for vectordump to output correct
document ids, or this is the normal behavior when one runs lda on a
directory of documents, or I make a mistake in one of those steps.

I suspect the issue is seqdirectory assigns Text ids to documents, while
CVB algorithm expects documents in another format, <IntWritable,
VectorWritable>. If this is the case, could you help me for assigning
IntWritable ids to documents in the process of creating vectors from them?
Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so?

Thanks

-- 
Gokhan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message