mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Date Thu, 31 Jan 2013 14:33:14 GMT
Hi Yutaka,


On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entradero@gmail.com> wrote:

> Hi
> Here is a question around how to evaluate the result of Mahout 0.7 CVB
> (Collapsed Variational Bayes), which used to be LDA
> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> I believe I have no prpblem running CVB itself and this is purely a
> question on the efficient way to visualize or evaluate the result.

Looks like result evaluation in Mahout-0.5 at least could be done using the
> utility called "LDAPrintTopic", however this is already
> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
>
> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> successfully and obtained results in two separate directory in
> /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as
> number of iterations and also in
> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
> specified as number of topics tha I wanted to extract/decomposite.
>
> Neither of the files contained in the directory can be dumped using Mahout
> vectordump, however the output format is way different
> from what you should've gotten using LDAPrintTopic in below 0.5 which
> should give you back the result as the Topic Id. and it's
> associated top terms in very direct format. (See "Mahout in Action" p.181
> again).
>

Vectordump should be exactly what you want, actually.


>
> Here is what I've done as below.
> 1. Say I have already generated document vector and use tf-vectors to
> generate a document/term matrix as
>
> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> NHTSA-matrix03
>
> 2. and get rid of the matrix docIndex as it should get in my way (as been
> advised somewhere…)
> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> NHTSA-matrix03-docIndex
>
> 3. confirmed if I have only what I need here as
> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> Found 1 items
> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> /user/hadoop/NHTSA-matrix03/matrix
>
> 4.and kick off CVB as
> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict
> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 –ow
> …
> ….
> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
> (Minutes: 733.1281333333334)
> (Took over 12hrs to complete to process 100k documents on my laptop with
> pseudo-distributed Hadoop 0.20.203)
>
> 5. Take a look at what I've got.
> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> Found 12 items
> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> /user/hadoop/NHTSA-LDA-sparse/_logs
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> [hadoop@localhost NHTSA]$
>

Ok, these should be your model files, and to view them, you
can do it the way you can view any
SequenceFile<IntWriteable, VectorWritable>, like this:

$MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
-dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType
sequencefile
--vectorSize 5 --sort

This will dump the top 5 terms (with weights - not sure if they'll be
normalized properly) from each topic to the output file "topic_dump.txt"

Incidentally, this same command can be run on the topicModelState
directories as well, which let you see how fast your topic model was
converging (and thus show you on a smaller data set how many iterations you
may want to be running with later on).


>
> and
> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> Found 20 items
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> /user/hadoop/temp/topicModelState/model-1
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> /user/hadoop/temp/topicModelState/model-10
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> /user/hadoop/temp/topicModelState/model-11
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> /user/hadoop/temp/topicModelState/model-12
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> /user/hadoop/temp/topicModelState/model-13
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> /user/hadoop/temp/topicModelState/model-14
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> /user/hadoop/temp/topicModelState/model-15
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> /user/hadoop/temp/topicModelState/model-16
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> /user/hadoop/temp/topicModelState/model-17
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> /user/hadoop/temp/topicModelState/model-18
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> /user/hadoop/temp/topicModelState/model-19
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> /user/hadoop/temp/topicModelState/model-2
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> /user/hadoop/temp/topicModelState/model-20
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> /user/hadoop/temp/topicModelState/model-3
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> /user/hadoop/temp/topicModelState/model-4
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> /user/hadoop/temp/topicModelState/model-5
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> /user/hadoop/temp/topicModelState/model-6
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> /user/hadoop/temp/topicModelState/model-7
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> /user/hadoop/temp/topicModelState/model-8
> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> /user/hadoop/temp/topicModelState/model-9
>
> Hope someone could help this out.
> Regards,,,
> Yutaka
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message