mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Interpretating doc-topic output of cvb
Date Tue, 25 Jun 2013 22:45:01 GMT
I'm glad it's now making sense, I'm sorry it was so hard to get to this
point!  If you were to write up some notes on how to get this to work, for
yourself of two weeks ago, we could post them on the wiki, save an
alternate-universe version of you some trouble! :)


On Tue, Jun 25, 2013 at 3:05 PM, Mark Wicks <mawicks@gmail.com> wrote:

> Andy,
>
> Thanks!   Now I see what's going on.  The keys are indeed there and do
> establish the correct order.  I had been using the csv export option which
> dropped the keys.
>
> My apologies for any confusion.  It's all good...
>
> Mark
>
>
>
> On Tue, Jun 25, 2013 at 5:51 PM, Andy Schlaikjer <
> andrew.schlaikjer@gmail.com> wrote:
>
>> Mark, I'm confused, the topic-term distributions are also sequencefile
>> data
>> where keys are IntWritable encoding the id of topic. Can you share those
>> as
>> well?
>>
>>
>> On Tue, Jun 25, 2013 at 2:35 PM, Mark Wicks <mawicks@gmail.com> wrote:
>>
>> > It looks like I may have spoken too soon.  With cvb from the trunk, the
>> > *values* of the document/topic inferences are now correct, but the
>> topics
>> > appear to be ordered differently in the topic/term distribution matrix
>> and
>> > the document/topic inference matrix.  Because of this permutation, you
>> > can't tell which topics go with which documents:
>> >
>> > I tested using the following term-frequency matrix:
>> >
>> > Key: 0: Value: /d01:{0:30.0,1:10.0,}
>> > Key: 1: Value: /d02:{0:60.0,1:20.0,}
>> > Key: 2: Value: /d03:{0:30.0,1:10.0,}
>> > Key: 3: Value: /d04:{0:60.0,1:20.0,}
>> > Key: 4: Value: /x01:{2:30.0,3:30.0,}
>> > Key: 5: Value: /x02:{2:60.0,3:60.0,}
>> > Key: 6: Value: /x03:{2:30.0,3:30.0,}
>> >
>> > cvb produced the following topic/term distributions
>> >
>> >
>> 4.166667614017639E-11,4.166662983931827E-11,0.4999999999583334,0.4999999999583334
>> >
>> >
>> 0.7499999999583331,0.24999999995833333,4.1666664564961835E-11,4.1666664564961835E-11
>> >
>> > and the following document/topic inferences
>> > 0.9999999999166667,8.333330597949465E-11
>> > 0.9999999999166667,8.333330597949465E-11
>> > 0.9999999999166667,8.333330597949465E-11
>> > 0.9999999999166667,8.333330597949465E-11
>> > 8.33333291299237E-11,0.9999999999166668
>> > 8.33333291299237E-11,0.9999999999166668
>> > 8.33333291299237E-11,0.9999999999166668
>> >
>> > Four documents have the 3:1:0:0 term distribution and three have the
>> > 0:0:1:1 term distribution, so either the columns of the document/topic
>> > inferences are reversed or the rows of the topic/term distributions are.
>> >
>> > Mark
>> >
>> >
>> >
>> > On Tue, Jun 25, 2013 at 1:23 PM, Mark Wicks <mawicks@gmail.com> wrote:
>> >
>> > > Sebastian,
>> > >
>> > > Yes, cvb works well after applying that patch (and the document/topic
>> > > inferences make sense now).
>> > >
>> > > Thanks!
>> > > Mark
>> > >
>> > >
>> > > On Tue, Jun 25, 2013 at 12:23 AM, Sebastian Schelter <ssc@apache.org>
>> > > wrote:
>> > > > Hi Mark,
>> > > >
>> > > > I think I broke this code when I cleaned up LDA recently. Can you
>> see
>> > > > whether everything works after applying the patch attached to
>> > > > https://issues.apache.org/jira/browse/MAHOUT-1268 ?
>> > > >
>> > > > Thanks,
>> > > > Sebastian
>> > > >
>> > > > On 24.06.2013 18:57, Mark Wicks wrote:
>> > > >> Thanks for the response.
>> > > >>
>> > > >> The command line I used is
>> > > >>
>> > > >> mahout cvb -ow -dict sparse/dictionary.file-0 -i matrix/matrix
-o
>> > > >> cvb/topics -dt cvb/classifications  -block 2 -x 2 -cd 1e-10  -k2
>> > > >> -seed 6956 -tf 0.25
>> > > >>
>> > > >> This completes with no errors in Mahout 0.7.  With Mahout/cvb
from
>> > > trunk I get:
>> > > >>
>> > > >> 13/06/24 12:48:32 INFO cvb.CVB0Driver: About to run: Writing final
>> > > >> topic/term distributions from temp/topicModelState/model-2 to
>> > > >> cvb/topics
>> > > >> 13/06/24 12:48:32 INFO input.FileInputFormat: Total input paths
to
>> > > process : 10
>> > > >> 13/06/24 12:48:33 INFO cvb.CVB0Driver: About to run: Writing final
>> > > >> document/topic inference from matrix/matrix to cvb/classifications
>> > > >> 13/06/24 12:48:33 INFO mapred.JobClient: Cleaning up the staging
>> area
>> > > >> hdfs://
>> > >
>> >
>> 192.168.84.8:9000/tmp/hadoop-hadoop/mapred/staging/mwicks/.staging/job_201304292057_0252
>> > > >> 13/06/24 12:48:33 ERROR security.UserGroupInformation:
>> > > >> PriviledgedActionException as:mwicks
>> > > >> cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output
>> > > >> directory cvb/topics already exists
>> > > >> Exception in thread "main"
>> > > >> org.apache.hadoop.mapred.FileAlreadyExistsException: Output
>> directory
>> > > >> cvb/topics already exists
>> > > >>         at
>> > >
>> >
>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
>> > > >>         at
>> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:949)
>> > > >>         at
>> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
>> > > >>         at java.security.AccessController.doPrivileged(Native
>> Method)
>> > > >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> > > >>         at
>> > >
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>> > > >>         at
>> > >
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
>> > > >>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
>> > > >>         at
>> > >
>> >
>> org.apache.mahout.clustering.lda.cvb.CVB0Driver.writeDocTopicInference(CVB0Driver.java:463)
>> > > >>         at
>> > >
>> org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:339)
>> > > >>         at
>> > >
>> org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:198)
>> > > >>         at
>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > >>         at
>> > >
>> org.apache.mahout.clustering.lda.cvb.CVB0Driver.main(CVB0Driver.java:534)
>> > > >>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> > > >>         at
>> > >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > > >>         at
>> > >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > > >>         at java.lang.reflect.Method.invoke(Method.java:601)
>> > > >>         at
>> > >
>> >
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> > > >>         at
>> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> > > >>         at
>> > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> > > >>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> > > >>         at
>> > >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > > >>         at
>> > >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > > >>         at java.lang.reflect.Method.invoke(Method.java:601)
>> > > >>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> > > >>
>> > > >>
>> > > >> I am certain that "cvb/topics" did not exist before running "mahout
>> > > >> cvb".  After the error, cvb/topics exists and contains data, but
>> > > >> cvb/classifications does not exist.
>> > > >>
>> > > >> Mark
>> > > >>
>> > > >
>> > >
>> >
>>
>
>


-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message