mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LDA Questions
Date Mon, 06 Aug 2012 17:19:53 GMT
Hi Gokhan,

  This looks like a bug in the
InMemoryCollapsedVariationBayes0.loadVectors()
method - it takes the SequenceFile<? extends Writable, VectorWritable> and
ignores
the keys, assigning the rows in order into an in-memory Matrix.

  If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
<output path>"
this converts Text keys into IntWritable keys (and leaves behind an index
file, a mapping
of Text -> IntWritable which tells you which int is assigned to which
original text key).

  Then what you'd want to do is modify
InMemoryCollapsedVariationBayes0.loadVectors()
to actually use the keys which are given to it, instead of reassigning to
sequential
ids.  If you make this change, we'd love to have the diff - not too many
people use
the cvb0_local path (it's usually used for debugging and testing smaller
data sets to see that topics are converging properly), but getting it to
actually produce
document -> topic outputs which correlate with original docIds would be
very nice! :)

On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <gkhncpn@gmail.com> wrote:

> Hi,
>
> My question is about interpreting lda document-topics output.
>
> I am using trunk.
>
> I have a directory of documents, each of which are named by integers, and
> there is no sub-directory of the data directory.
> The directory structure is as follows
> $ ls /path/to/data/
>    1
>    2
>    5
>    ...
>
> From those documents I want to detect topics, and output:
> - topic - top terms
> - document - top topics
>
> To this end, I first run seqdirectory on the directory:
> $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
>
> Then I run seq2sparse to create tf vectors of documents:
> $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
> --namedVector
>
> After creating vectors, I run cvb0_local on those tf-vectors:
> $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0
>
> And to interpret the results, I use mahout's vectordump utility:
> $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10
> -sort true -p true
>
> $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
> $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10
> -sort true -p true
>
> The resulting words file consists of #ofTopics lines.
> I assume each line is in <topicID \t wordsVector> format, where a
> wordsVector is a sorted vector whose elements are <word, score> pairs.
>
> The resulting docs file on the other hand, consists of #ofDocuments lines.
> I assume each line is in <documentID \t topicsVector> format, where a
> topicsVector is a sorted vector whose elements are <topicID, probability>
> pairs.
>
> The problem is that the documentID field does not match with the original
> document ids. This field is populated with zero-based auto-incrementing
> indices.
>
> I want to ask if I am missing something for vectordump to output correct
> document ids, or this is the normal behavior when one runs lda on a
> directory of documents, or I make a mistake in one of those steps.
>
> I suspect the issue is seqdirectory assigns Text ids to documents, while
> CVB algorithm expects documents in another format, <IntWritable,
> VectorWritable>. If this is the case, could you help me for assigning
> IntWritable ids to documents in the process of creating vectors from them?
> Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so?
>
> Thanks
>
> --
> Gokhan
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message