mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Goel <ankitgoel2...@gmail.com>
Subject Re: how to interpret the result of the clustering by “mahout kmeans”
Date Thu, 23 Jul 2015 00:24:47 GMT
Hi Kumada,
I had the same problem till 2 days ago. Heres a few things I figured out
which I think would help. However I'm working with mahout 0.10.0, so I
might be very slightly off on what I"m saying.

Firstly from your results the format for 0.09 does seem to miss the column
id like you mentioned. Pat thinks there might be a problem with the way
data was entered I think. The work around this is to access it through java
as opposed to commandline. I was quite confused with some things (my
dictionary had over 3400 terms) so using java helped me get clarity on a
lot of things. Though java code, you will be able to extract the values of
the columns properly.

Mahout is built on hadoop, which uses a file system called sequential
files. They have multiple storage benefits, of which I dont know any cept
that they save data in a more concise manner. So any program you write
deals with sequential files. In fact you have come across them when you
were saving your data from text file to mahout vector format. You probably
used *mahout seqdirectory* in the very start. You can explore sequence
files with *mahout seqdumper*. So what Pat is asking (correct me if i'm
wrong) is after converting your raw data to mahout readable format, did u
check to see if they were right.

On Thu, Jul 23, 2015 at 4:50 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Clusterdump is a tool for examining the output. The sequencefiles *are*
> the output.
>
> run “mahout kmeans” and get a list of the options and where output is
> stored.
>
> On Jul 21, 2015, at 5:49 PM, 熊田 聖也 <seiya.kumada@cct-inc.co.jp> wrote:
>
> Thank for your reply.
> I uses Amazon ElasticMapReduce(EMR).
> It supports mahout-0.9/0.8, but not 0.7.
> In the case of mahout-0.9/0.8,  the result obtained by “mahout
> clusterdump” does not contain the column id, but the result by 0.7 contains
> it.
>
> I have one question on your statement "Are the results in the sequence
> files correct? ."
> What do the sequence files mean?
> Which command of "mahout" yields them?
>
> Sincerely yours
> S.Kumada
>
>
> ________________________________________
> 差出人: Pat Ferrel <pat@occamsmachete.com>
> 送信日時: 2015年7月19日 3:00
> 宛先: user@mahout.apache.org
> 件名: Re: how to interpret the result of the clustering by “mahout kmeans”
>
> This is probably a clusterdump formatting problem in Mahout 0.9, have you
> tried Mahout 0.10.1, which is the latest version?
>
> Are the results in the sequence files correct? They are sparse vectors so
> must contain the column id.
>
>
> On Jul 14, 2015, at 1:20 AM, 熊田 聖也 <seiya.kumada@cct-inc.co.jp> wrote:
>
>
> Grad to see you.
>
> This is my first question in the mahout mailing list.
>
>
> I’m now calculating the clustering by using “mahout means.”
>
> My data is as follows:
>
>
> @RELATION rfm
>
> @ATTRIBUTE recency NUMERIC
>
> @ATTRIBUTE frequency NUMERIC
>
> @ATTRIBUTE money NUMERIC
>
> @ATTRIBUTE location NUMERIC
>
> @ATTRIBUTE position NUMERIC
>
> @DATA
>
> 0.472,0.275,0.099,0.952,0.047,
>
> 0.000,0.824,0.936,0.214,0.000,
>
> 0.000,0.537,0.656,0.591,0.000,
>
> ....
>
> 0.908,0.000,0.000,0.078,0.136,
>
> 0.134,0.000,0.000,0.781,0.160,
>
> 0.302,0.000,0.000,0.513,0.715,
>
> 0.472,0.000,0.000,0.749,0.047,
>
>
> The file is the ARFF format.
>
> Each row is the 5-dimensional vector and the most of rows contain zero
> values.
>
> I converted the ARFF to the Vector format for the purpose of "mahout
> kmeans."
>
> The resultant file is as follows:
>
>
> Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}
>
> Key: 1: Value: {1:0.824,2:0.936,3:0.214}
>
> Key: 2: Value: {1:0.537,2:0.656,3:0.591}
>
> Key: 3: Value: {1:0.954,2:0.253,3:0.721}
>
> Key: 4: Value: {1:0.187,2:0.735,3:0.782}
>
> Key: 5: Value: {1:0.517,2:0.276,3:0.096}
>
> Key: 6: Value: {1:0.189,2:0.127,3:0.517}
>
> ...
>
> Key: 993: Value: {0:0.662,3:0.218,4:0.69}
>
> Key: 994: Value: {0:0.56,3:0.682,4:0.153}
>
> Key: 995: Value: {0:0.788,3:0.929,4:0.967}
>
> Key: 996: Value: {0:0.908,3:0.078,4:0.136}
>
> Key: 997: Value: {0:0.134,3:0.781,4:0.16}
>
> Key: 998: Value: {0:0.302,3:0.513,4:0.715}
>
> Key: 999: Value: {0:0.472,3:0.749,4:0.047}
>
>
> In the above result, each vector is represented by the dictionary format,
> e.g.
>
> {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.
>
>
> Using the file, I carried out "mahout kmeans."
>
> (The current version of the mahout is 0.9.)
>
> After the calculation, I typed “mahout clusterdump”
>
> and got the result as shown below:
>
>
> VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}
>
> VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}
>
> VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}
>
> VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023,
> 0.137, 0.155]}
>
> VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}
>
> VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}
>
> VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}
>
> VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}
>
> VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}
>
> VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025,
> 0.133, 0.192]}
>
>
> On the other hand, when the same calculation is done by the mahout with
> version 0.7, the result is as follows:
>
>
> VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157,
> 0.115, 0.168]}
>
> VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}
>
> VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}
>
> VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}
>
> VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}
>
> VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}
>
> VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}
>
> VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}
>
> VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}
>
> VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026,
> 0.170, 0.135]}
>
>
> In the result by the version 0.7, the centroid coordinate is represented
> by the dictionary format, e.g.
>
> c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].
>
> However, in the result by version 0.9, we can not correctly know the
> centroid coordinate,
>
> because we can not know zero positions.
>
>
> Cloud you tell me how to interpret the result by the version 0.9 ?
>
>
>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message