mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuji NISHIDA@U-Tokyo" <nishidyatuto...@gmail.com>
Subject Re: mahout clusterdump output
Date Mon, 10 Sep 2012 16:46:11 GMT
Thank you for your kind explanation.
I added -cl option when conducting kmeans, so it seems no longer a problem.

But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
Do you think this is correct output?

Best regards.

2012/9/11 Jeff Eastman <jdog@windwardsolutions.com>:
> I think the discrepancy between the number (n=) of vectors reported by the
> cluster and the number of points actually clustered by the -cl option is
> normal.
>
> In the final iteration, points are assigned to (observed by) (classified as)
> each cluster based upon the distance measure and the cluster center computed
> from the previous iteration. The (n=) value records the number of points
> "observed by" the cluster in that iteration.
> After the final iteration, a new cluster center is calculated for each
> cluster. This moves the center by some amount, less than the convergence
> threshold, but it moves.
> During the subsequent classification (-cl) step, these new centers are used
> to classify the points for output. This will inevitably cause some points to
> be assigned to (observed by) (classified as) a different cluster and so the
> output clusteredPoints will reflect this final assignment.
>
> In small, contrived examples, the clustering will likely be more stable
> between the final iteration and the output of clustered points.
>
>
>
> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>
> Hi,
>
> I too am having this problem.  I have a very small dimension space (3), and
> a lot of vectors (hundreds of millions).  Therefore I can't print all to
> disk (I receive an OOM error).  However, I can print 30 sample points
> easily, and doing so showed results similar to you (I "named" my vectors to
> be the number of vectors clusterDumper printed in the cluster):
>
> VL-50{n=0 c=[...] r=[]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:   10 = [...]
>
> --> note also radius is blank, whereas the points do have spread in all
> dimensions, this happened ONLY with converged clusters.
>
> CL-51{n=4 c=[...] r=[...]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:    6 = [...]
>
> As far as I understand the algorithm, problems which arise due to
> dimensionality are convergence problems.  Basically, distance between points
> is "longer" as dimension increases (volume increases dramatically as
> dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on
> sequence files from a completed job.  This is why the discrepancy is not
> making a lot of sense to me.  Having more vectors within each cluster makes
> sense -- when I sum the printed n values, I receive a number magnitudes
> smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@U-Tokyo <nishidyatutokyo@gmail.com>:
>
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo
>
>
>



-- 
nishidy@u-tokyo

Mime
View raw message