mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: how to interpret the result of the clustering by “mahout kmeans”
Date Sat, 18 Jul 2015 18:00:09 GMT
This is probably a clusterdump formatting problem in Mahout 0.9, have you tried Mahout 0.10.1,
which is the latest version?

Are the results in the sequence files correct? They are sparse vectors so must contain the
column id.


On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <seiya.kumada@cct-inc.co.jp> wrote:


Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result
is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary
format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?



Mime
View raw message