mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Using SVD with Canopy/KMeans
Date Sat, 11 Sep 2010 23:55:31 GMT

On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:

> Should be close.  The matrixMult step may be redundant if you want to
> cluster the same data that you decomposed.  That would make the second
> transpose unnecessary as well.

Hmm, I thought I was just translating what Jeff had done below, specifically:

>>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,



> 
> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gsingers@apache.org>wrote:
> 
>> To put this in bin/mahout speak, this would look like, munging some names
>> and taking liberties with the actual argument to be passed in:
>> 
>> bin/mahout svd (original -> svdOut)
>> bin/mahout cleansvd ...
>> bin/mahout transpose svdOut -> svdT
>> bin/mahout transpose original -> originalT
>> bin/mahout matrixmult originalT svdT -> newMatrix
>> bin/mahout kmeans newMatrix
>> 
>> Is that about right?
>> 
>> 
>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>> 
>>> Ok, the transposed computation seems to work and the cast exception was
>> caused by my unit test writing LongWritable keys to the testdata file. The
>> following test produces a comparable answer to the non-distributed case. I
>> still want to rename the method to transposeTimes for clarity. And better,
>> implement timesTranspose to make this particular computation more efficient:
>>> 
>>> public void testKmeansDSVD() throws Exception {
>>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>   Path output = getTestTempDirPath("output");
>>>   Path tmp = getTestTempDirPath("tmp");
>>>   Path eigenvectors = new Path(output, "eigenvectors");
>>>   int desiredRank = 13;
>>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>   Configuration config = new Configuration();
>>>   solver.setConf(config);
>>>   Path testData = getTestTempDirPath("testdata");
>>>   int sampleDimension = sampleData.get(0).get().size();
>>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
>> sampleDimension, false, desiredRank);
>>> 
>>>   // now multiply the testdata matrix and the eigenvector matrix
>>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>> tmp, desiredRank - 1, sampleDimension);
>>>   JobConf conf = new JobConf(config);
>>>   svdT.configure(conf);
>>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>> sampleData.size(), sampleDimension);
>>>   a.configure(conf);
>>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>   // run ClusterDumper
>>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>   clusterDumper.printClusters(termDictionary);
>>> }
>>> 
>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>> Looking at the single unit test of DMR.times() it seems to be
>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>> implemented. But the method still has the wrong name and AFAICT is not
>> useful for performing this particular computation. Should I use this
>> instead?
>>>> 
>>>> DistributedRowMatrix sData =
>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>> 
>>>> ugh! And it still fails with:
>>>> 
>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>> be cast to org.apache.hadoop.io.IntWritable
>>>>   at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>   at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


Mime
View raw message