mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: K-Means on Hadoop Cluster
Date Sun, 25 May 2014 16:39:06 GMT
You need to create a Mahout distributed row matrix, which is one or more SequenceFiles of:
<IntWritable>: <VectorWritable>

The vector will have all your values, the first IntWritable has the Mahout ID/key for the
vector. It is a positive ordinal. Usually this corresponds to some ID you have for the vector
so you create a Mahout Int for each new vector, and put it in a dictionary that maps your
id to/from the Mahout id. Then after clustering you map the mahout ID back to yours.

The VectorWritable is created with a Vector. As you have stated things you would use a DenseVector
implementation. If you have a lot of 0s you may want to give your columns Mahout IDs too and
use sparse vectors to create a sparse matrix. All missing values are assumed to have a 0 value.
This may improve the performance. It will also allow you to use an implementation of Vector
called NamedVector, which allows you to put your ID in the Vector as a string to follow the
vector through the calculations.

On May 24, 2014, at 11:35 AM, Adri Gómez <> wrote:


First, sorry for my English.

I'm a noob in Mahout and Hadoop. I want to run kmeans clustering on a
Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file,
with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0
0 ...

I've run the examples that I've found, like Reuters ( or
synthetic data. I know i have to convert this vectors to SequenceFile, but
I don't know if I have to do something more before.

I'm using Mahout 0.7 and Hadoop 1.2.1.


*Gómez Muñoz, Adrián.*

View raw message