mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Dirichlet ClusterDump Output
Date Wed, 05 May 2010 03:19:21 GMT
Hi Delroy,

You did not say if you were using 0.3 or trunk; I suggest trunk since it 
has been recently better integrated with Dirichlet. Looking at your code 
fragment and comparing it with what the ClusterDumper is (now) doing:

       SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
       Writable key = (Writable) reader.getKeyClass().newInstance();
       Writable value = (Writable) reader.getValueClass().newInstance();
       while (reader.next(key, value)) {
         Cluster cluster = (Cluster) value;
         String fmtStr = useJSON ? cluster.asJsonString() : 
cluster.asFormatString(dictionary);

... it kinda looks like you are not actually reading in the clusters and 
their models; rather just creating a new instance of DirichletCluster 
(the value class). This approach will not read in the model or any of 
the cluster state, hence your observations. You should be able to just 
run the ClusterDumper by pointing at your cluster directory as in 
TestClusterDumper.testDirichlet.

If you really want to write your own code for reading the clusters, I 
suggest copying the above and remembering to create a new value object 
in your loop otherwise the first instance will be reused by the reader 
and you will end up with all your clusters being identical. Something 
like this:

       while (reader.next(key, value)) {
         DirichletCluster cluster = (DirichletCluster) value;
         String fmtStr = useJSON ? cluster.asJsonString() : 
cluster.asFormatString(dictionary);
<save the cluster in some data structure>
         value = (Writable) reader.getValueClass().newInstance();
         }

Let me know how it goes,
Jeff

On 5/4/10 5:54 PM, Delroy Cameron wrote:
> so i've run Dirichlet Clustering using Mahout and i'm trying to see the
> clusterdump. Of course i'm using a combination of ClusterDumper,
> DirichletOutputState and DirichletCluster and TestL1ModelClustering to help
> with the output.
>
> so far i've successfully read each file in each state-x output folder. The
> issue is that the vectors appear to be serialized as<Text,
> DirichletCluster>  pairs in each binary dump, which is fine. However, after
> debugging it turns out that the model for each DirichletCluster is
> null....and this make sense, since i'm reading from the dump file as
> follows:
>
> SequenceFile.Reader  reader = new SequenceFile.Reader(fileSystem, inputPath,
> conf);
> Text key = (Text) reader.getKeyClass().newInstance();
> DirichletCluster cluster = (DirichletCluster)
> reader.getValueClass().newInstance();
>
> i tried to set the fields for the DirichletCluster by using the following
> method readFields(DataInput in);
> DataInput istream = new DataInputStream(new FileInputStream(new
> File(fileName)));
> cluster.readFields(istream);
>
> and i have a null pointer exception...
>
> can i have a few suggestion on how to proceed here...
>
> -----
> --cheers
> Delroy
>    


Mime
View raw message