mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Tranforming data for k-means analysis
Date Tue, 07 Sep 2010 17:30:05 GMT
  When you run kmeans from the command line with a -k value, the run() 
method calls the RandomSeedGenerator before calling the job() method to 
run the iterations. It's only when using the job() method directly from 
user code that you would perhaps want to use the RandomSeedGenerator (or 
Canopy) to populate the clusters in the -ci directory. So, yes, from the 
command line the driver already does it.

I suggested looking at the InputDriver code as that is what converts the 
space-delimited synthetic control text file to Mahout sequence of 
VectorWritable file format. Once you have data in that format you should 
be good to go with any of the clustering implementations.

On 9/7/10 10:17 AM, rmx wrote:
> Hi Radek,
> If you do not want to use the script, you can run the kmeans drive directly
> from the command line.
> I think first you need to convert your dataset to a mahout vector format.
> Then you need to convert to sequence file format. Only after it you can run
> the driver over your sequence file.
> I have been trying to do this but I never been successful. Tell me if you
> will...
> Jeff: when using kmeans drive from the command line with a -k value, you
> need to use RandomSeedGenerator.buildRandom()? I thought the driver already
> does it.
> Best,
> Rui

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message