mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Using SVD with Canopy/KMeans
Date Thu, 02 Sep 2010 17:50:01 GMT
On Thu, Sep 2, 2010 at 10:17 AM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

>  I'm not clear on what you are trying to do and why. The clustering
> applications all expect an input directory full of
> sequence<*,VectorWritable> files. It looks like such files are produced by
> DistributedLanczosSolver in the outputEigenVectorPath but what would
> clustering them give you? Alternatively, if you intend to reconstruct a
> reduced-dimensionality dataset from the eigenvectors then that might be
> useful but you would need to add a step to reconstruct that dataset before
> feeding it to the clustering.


Derek,

  The step Jeff's referring to is that the SVD output is a set of vectors in
the "column space" of your original set of rows (your input matrix).  If you
want to cluster your original data, projected onto this new SVD basis, you
need to matrix multiply your SVD matrix by your original data.  Depending on
how big your data is (number of rows and columns and rank of the reduction),
you can do this in either one or two map-reduce passes.

  If you need more detail, I can spell that out a little more directly.  It
should actually be not just explained in words, but coded into the examples,
now that I think of it... need. more. hours. in. day....

  -jake


>
>
>
> On 9/2/10 9:18 AM, Derek O'Callaghan wrote:
>
>> Hi,
>>
>> I've recently started to try out Mahout for clustering. To familiarise
>> myself with the code, I've created a copy of TestClusterDumper and have
>> modified getSampleData() to load my own test data. So far, I've been trying
>> out Canopy and KMeans (testCanopy() and testKmeans()), and I'd like to
>> extend this to perform SVD on the data before executing these.
>>
>> I assume this is just a case of adding a call to
>> DistributedLanczosSolver.run() beforehand, and then running Canopy/Kmeans
>> afterwards. However, I'm a little unsure about the input/output paths usage.
>> E.g., TestClusterDumper.testKmeans() currently looks like:
>>
>> .
>> .
>> .
>> // now run the Canopy job to prime kMeans canopies
>> Path output = getTestTempDirPath("output");
>> CanopyDriver.runJob(getTestTempDirPath("testdata"), output, measure, 8, 4,
>> false, false);
>> // now run the KMeans job
>> KMeansDriver.runJob(getTestTempDirPath("testdata"), new Path(output,
>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>> .
>> .
>> .
>>
>> In order to run the Lanczos solver, I assume I can do something like:
>>
>> .
>> .
>> .
>> Path output = getTestTempDirPath("output");
>>
>> // Run SVD first
>> DistributedLanczosSolver.run(getTestTempDirPath("testdata"),
>> <outputTmpPath>, <outputEigenVectorPath>,.....);
>>
>> // now run the Canopy job to prime kMeans canopies
>> CanopyDriver.runJob(<input data path> , output, measure, 8, 4, false,
>> false);
>> // now run the KMeans job
>> KMeansDriver.runJob(<input data path>, new Path(output, "clusters-0"),
>> output, measure, 0.001, 10, 1, true, false);
>> .
>> .
>> .
>>
>> My question is, from the example above, what path value from the Lanczos
>> call should I use for the <input data path> to Canopy/KMeansDriver.runJob()?
>> Or, am I going about this the wrong way?
>>
>> Thanks,
>>
>> Derek
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message