From user-return-6106-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Tue Jan 25 05:29:03 2011 Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 72869 invoked from network); 25 Jan 2011 05:29:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jan 2011 05:29:03 -0000 Received: (qmail 18540 invoked by uid 500); 25 Jan 2011 05:29:03 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 18132 invoked by uid 500); 25 Jan 2011 05:28:59 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 18124 invoked by uid 99); 25 Jan 2011 05:28:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 05:28:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of goksron@gmail.com designates 209.85.161.42 as permitted sender) Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 05:28:51 +0000 Received: by fxm11 with SMTP id 11so4793234fxm.1 for ; Mon, 24 Jan 2011 21:28:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=HLVD7RqnwFlP4k4vSjlVmodikivQNfFK/m0rbM1C4Kc=; b=j3jbEa89cuxoIe7Xz4GHs1t4sXVQ+MsK8s1QpbO//35XXfK3ZWEsxoLTJ00jzvRepe KmPbRlpB1zGQnJ8kdNps8rDBBon1QAKU7DLs/BWF69GBF492MtJpShJVBd7Blavbdxo6 Yl6VjZlHt2HOkLhzbHkuIJy3DRHzdah6ATQHU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=TMkflKXXpae7r+Zse9lBlSfaSL4U1UbrtEmqW3Uh0fpQ3XH1beGo4HAqFEg10kTkIA nZ31bJOjXmR1T88QzQF8noNjZ4M3zz9uHXuBmJPphfhtFKjuqaxc8OFqqEZAvDCF5EUZ P8sJ3wclpwyDKLyLWpbdaesPL8pKpiFqEofOQ= MIME-Version: 1.0 Received: by 10.223.96.78 with SMTP id g14mr1591189fan.15.1295933310338; Mon, 24 Jan 2011 21:28:30 -0800 (PST) Received: by 10.223.83.204 with HTTP; Mon, 24 Jan 2011 21:28:30 -0800 (PST) In-Reply-To: References: <4D370847.7030006@windwardsolutions.com> <4D387EBC.3090802@windwardsolutions.com> Date: Mon, 24 Jan 2011 21:28:30 -0800 Message-ID: Subject: Re: Meanshift clustering From: Lance Norskog To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sure, please. Please include hints about the various cases where it works (or does not). Thanks! On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev wrote: > Hello, > > In fact my estimation technique works only for the Meanshift algorithm, > because in it the centers of the canopies are moving around. With pure > Canopy clustering I don't think it will work. > > I spent some time trying to realize the idea of different kernel profiles > with the meanshift clustering algorithm and here are the results: > 1. I changed a little bit the original algorithm. Previously when one > cluster "touched" other clusters its centroid was computed only based on = the > centroids of the clusters it touched, but not based on his centroid itsel= f. > Now befor calculating the shift I added one more line which makes the > cluster observe his personal centroid: > > public boolean shiftToMean(MeanShiftCanopy canopy) { > =C2=A0 =C2=A0canopy.observe(canopy.getCenter(), canopy.getBoundPoints().s= ize()); > =C2=A0 =C2=A0canopy.computeConvergence(measure, convergenceDelta); > =C2=A0 =C2=A0canopy.computeParameters(); > =C2=A0 =C2=A0return canopy.isConverged(); > =C2=A0} > > With this modification also the problem with convergence of the algorithm > (that I described above) disappeared, although the number of iterations > until convergence increased slightly. > This change was necessary in order to introduce the other types of "soft" > kernels. > > 2. I introduced IKernelProfile interface which has the methods: > public double calculateValue(double distance, double h); > =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub= le h); > > 3. I created 2 implementations: > TriangularKernelProfile with calculated value: > @Override > =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub= le h) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0return (distance < h) ? 1.0 : 0.0; > =C2=A0 =C2=A0} > > and NormalKernelProfile with calculated value: > @Override > =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub= le h) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0return UncommonDistributions.dNorm(distance, 0= .0, h); > =C2=A0 =C2=A0} > > 4. I modified the core for merging canopies: > > public void mergeCanopy(MeanShiftCanopy aCanopy, Collection > canopies) { > =C2=A0 =C2=A0MeanShiftCanopy closestCoveringCanopy =3D null; > =C2=A0 =C2=A0double closestNorm =3D Double.MAX_VALUE; > =C2=A0 =C2=A0for (MeanShiftCanopy canopy : canopies) { > =C2=A0 =C2=A0 =C2=A0double norm =3D measure.distance(canopy.getCenter(), > aCanopy.getCenter()); > =C2=A0 =C2=A0 =C2=A0double weight =3D kernelProfile.calculateDerivativeVa= lue(norm, t1); > =C2=A0 =C2=A0 =C2=A0if(weight > 0.0) > =C2=A0 =C2=A0 =C2=A0{ > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0aCanopy.touch(canopy, weight); > =C2=A0 =C2=A0 =C2=A0} > =C2=A0 =C2=A0 =C2=A0if (norm < t2 && ((closestCoveringCanopy =3D=3D null)= || (norm < > closestNorm))) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0closestNorm =3D norm; > =C2=A0 =C2=A0 =C2=A0 =C2=A0closestCoveringCanopy =3D canopy; > =C2=A0 =C2=A0 =C2=A0} > =C2=A0 =C2=A0} > =C2=A0 =C2=A0if (closestCoveringCanopy =3D=3D null) { > =C2=A0 =C2=A0 =C2=A0canopies.add(aCanopy); > =C2=A0 =C2=A0} else { > =C2=A0 =C2=A0 =C2=A0closestCoveringCanopy.merge(aCanopy); > =C2=A0 =C2=A0} > =C2=A0} > > 5. I modified MeanShiftCanopy > void touch(MeanShiftCanopy canopy, double weight) { > =C2=A0 =C2=A0canopy.observe(getCenter(), weight*((double)boundPoints.size= ())); > =C2=A0 =C2=A0observe(canopy.getCenter(), weight*((double)canopy.boundPoin= ts.size())); > =C2=A0} > > 6. I modified some other files which were necessary to compile the code a= nd > for the tests to pass > > With the so changed algorithm I had the following observations: > > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > content is more intermixed > 2. As there is no convergence problem any more I found the following > procedure for estimating T2 and convergence delta: > =C2=A0- bind convergence delta =3D T2 / 2 > =C2=A0- When you decrease T2 the number of iterations increases and the n= umber of > resulting clusters after convergence decreases > =C2=A0- You come to a moment when you decrease T2 the number of iteration= s > increases, but the number of resulting clusters remains unchanged. This i= s > the point with the best value for T2 > 3. In case of Normal kernel what you give as T1 is in fact the standard > deviation of a normal distribution, so the radius of the window will be T= 1^2 > > If you are interested I can send the code. > > Regards, Vasil > > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog wrote: > >> Vasil- >> >> Would you consider adding your estimation algorithm to this patch? >> https://issues.apache.org/jira/browse/MAHOUT-563 >> >> The estimator in there now is stupid- a real one would make the Canopy >> algorithms orders of magnitude more useful. >> >> Lance >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning >> wrote: >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev >> wrote: >> > >> >> >> >> dimension 1: Using linear regression with gradient descent algorithm = I >> find >> >> what is the trend of the line, i.e. is it increasing, decreasing or >> >> straight >> >> line >> >> dimension 2: Knowing the approximating line (from the linear regressi= on) >> I >> >> count how many times this line gets crossed by the original signal. T= his >> >> helps in separating the cyclic data from all the rest >> >> dimension 3: What is the biggest increase/decrease of a single signal >> line. >> >> This helps find shifts >> >> >> >> So to say - I put a semantics for the data that are to be clustered (= I >> >> don't >> >> know if it is correct to do that, but I couldn't think of how an >> algorithm >> >> could cope with the task without such additional semantics) >> >> >> > >> > It is very common for feature extraction like this to be the key for >> > data-mining projects. =C2=A0 Such features are absolutely critical for= most >> time >> > series mining and are highly application dependent. >> > >> > One key aspect of your features is that they are shift invariant. >> > >> > >> >> Also I developed a small swing application which visualizes the >> clustered >> >> signals and which helped me in playing with the algorithms. >> >> >> > >> > Great idea. >> > >> >> >> >> -- >> Lance Norskog >> goksron@gmail.com >> > --=20 Lance Norskog goksron@gmail.com