mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Tranforming data for k-means analysis
Date Mon, 06 Sep 2010 15:12:43 GMT
  Hi Radek,

I think you are on the right track building off of the synthetic control 
example. It has an initial pre-processing step (canopy.InputDriver) that 
converts space-delimited text files into Mahout VectorWritable sequence 
files that are suitable for input to Canopy and k-Means. It could be as 
simple as changing your delimiter from tab to space or you might need to 
write your own pre-processor. The kmeans.Job file runs this job then 
fires off Canopy to produce the initial clusters. You will need to play 
with the T1 and T2 values in this step in order to get the number of 
clusters you want (~20). You can skip this step if you know a value of k 
that you want; simply add the -k argument to the mahout kmeans command 
and run it from the command line. That will random sample your dataset 
to determine the initial cluster centers. (Sorry the KMeansDriver public 
methods  expect the initial clusters to be in the -ci directory already 
and don't allow the sampling, but there is 
RandomSeedGenerator.buildRandom() which you can use to produce these 
from your input data).

Let me know how this works for you,
Jeff


On 9/6/10 5:45 AM, Radek Maciaszek wrote:
> Hi,
>
> I am trying to use Mahout for my MSc project. I successfully run all
> clustering examples and I now am trying to analyse some of my data,
> unfortunately without much of success.
>
> Input data which I want to cluster is a list of vectors in a tab separated
> format:
> 1.2   0.0   0.0  3.414
> 0.0   0.4   0.0   0.3
> 16.2  0.0   0.0   0.0
> etc.
> I generated this file in python and can easily change it to be in comma
> separated format or make any necessary changes. It is rather a large file
> with many thousands of dimensions and millions of rows and it contains
> TF/IDF numbers calculated for users and URLs they visited (each row is a
> user and each column a URL). Each rows is a sparse vector.
>
> I would like to cluster users using kmeans into 20+ clusters. Now I am
> having problems with running clustering on this data. On the beginning I
> tried simply to put this file instead of a "testdata" filename on hadoop
> (originally synthetic_control.data) and was running "mahout
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job". I was hoping to
> reuse the existing scripts but that unfortunately gives me some null pointer
> exceptions.
>
> What would be the fastest/best way of analysing this matrix in order to
> group the rows into clusters?
>
> Many thanks for your advice,
> Radek
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message