mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Wicks <mawi...@gmail.com>
Subject Re: Interpretating doc-topic output of cvb
Date Thu, 20 Jun 2013 17:17:57 GMT
I apologize for posting this again.  I sent it during the weekend and
didn't get any response (which seems unusual for this list :)).
I am hoping that someone with some LDA/cvb experience who can help
might have missed it over the weekend.
Can someone tell me (1) if the document-topic distribution below makes
sense for the term frequencies shown and (2) how I should interpret
it.

Mark Wicks

On Sat, Jun 15, 2013 at 9:22 AM, Mark Wicks <mawicks@gmail.com> wrote:
> I am having trouble interpreting the "doc-topic" distribution produced
> by the cvb implementation of LDA in Mahout 0.7. Here's the
> term-frequency matrix for a simple test case (shown here as the output
> of mahout seqdumper):
>
> Key: /d01: Value: /d01:{0:30.0,1:10.0}
> Key: /d02: Value: /d02:{0:60.0,1:20.0}
> Key: /d03: Value: /d03:{0:30.0,1:10.0}
> Key: /d04: Value: /d04:{0:60.0,1:20.0}
> Key: /x01: Value: /x01:{2:30.0,3:10.0}
> Key: /x02: Value: /x02:{2:60.0,3:20.0}
> Key: /x03: Value: /x03:{2:30.0,3:10.0}
> Count: 7
>
> The intent here was that the d01 through d04 documents would consist almost
> entirely of one topic represented almost entirely by terms 0 and 1
> with a topic-term
> distribution of [0.75, 0.25, epsilon, epsilon] and that the x01
> through x03 documents
> would consist almost entirely of a second topic represented almost entirely by
> terms 2 and 3 with a topic-term distribution of [epsilon, epsilon,
> 0.75, 0.25]. Since
> the "d" documents do not contain terms 2 or 3 and the "x" documents do
> not contain
> terms 0 or 1, I expected to see document topic distributions that were
> approximately
> equal to
>
> d01: 1 0
> d01: 1 0
> d02: 1 0
> d03: 1 0
> x01: 0 1
> x02: 0 1
> x03: 0 1
>
> I ran the following command (where the simplelda/sparse/matrix directory
> contained the previous term frequency matrix). The algorithm ran to completion
> (meaning that it converged before the maximum number of iterations was
> exceeded).
>
> mahout  cvb \
>    -i simplelda/sparse/matrix \
>    -dict simplelda/sparse/dictionary.file-0 \
>    -ow -o simplelda/cvb-topics \
>    -dt simplelda/cvb-classifications \
>         -tf  0.25 \
>    -block 4 \
>    -x 20 \
>    -cd 1e-10 \
>    -k 2 \
>    --tempDir simplelda/temp-k2 \
>    -seed 6956
>
> The topic-term frequencies written to simplelda/cvb-topics were accurate and as
> expected:
>
> {0:0.7499999999895863,1:0.2499999999548601,2:2.7776873636508568E-11,3:2.777682733874987E-11}
> {0:9.375466996550278E-11,1:9.375456577819702E-11,2:0.7499999998802006,3:0.24999999993229008}
>
> However, the document-topic distribution output written to
> simplelda/cvbclassifications was not at all what I expected:
>
> Key: 0: Value: {0:0.05705773500297721,1:0.9429422649970228}
> Key: 1: Value: {0:0.05705773500297721,1:0.9429422649970228}
> Key: 2: Value: {0:0.05705773500297721,1:0.9429422649970228}
> Key: 3: Value: {0:0.05705773500297721,1:0.9429422649970228}
> Key: 4: Value: {0:0.4335650246424872,1:0.5664349753575127}
> Key: 5: Value: {0:0.4335650246424872,1:0.5664349753575127}
> Key: 6: Value: {0:0.4335650246424872,1:0.5664349753575127}
> Count: 7
>
> These are called "doc-topic distributions" in the help output, so I
> interpreted this to
> mean that the estimator concluded the "d" document terms were most likely all
> drawn from the second topic. But the "d" documents contain no terms from the
> second topic! Likewise, the "x" documents contain no terms from the
> first topic, so
> why is there a relatively large value (0.4335) in the first column. If
> this document-
> topic distribution produced by cvb is correct, what does it represent?

Mime
View raw message