From user-return-6106-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Tue Jan 25 05:29:03 2011
Return-Path:
Delivered-To: apmail-mahout-user-archive@www.apache.org
Received: (qmail 72869 invoked from network); 25 Jan 2011 05:29:03 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3)
by minotaur.apache.org with SMTP; 25 Jan 2011 05:29:03 -0000
Received: (qmail 18540 invoked by uid 500); 25 Jan 2011 05:29:03 -0000
Delivered-To: apmail-mahout-user-archive@mahout.apache.org
Received: (qmail 18132 invoked by uid 500); 25 Jan 2011 05:28:59 -0000
Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@mahout.apache.org
Delivered-To: mailing list user@mahout.apache.org
Received: (qmail 18124 invoked by uid 99); 25 Jan 2011 05:28:58 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 05:28:58 +0000
X-ASF-Spam-Status: No, hits=-0.7 required=10.0
tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of goksron@gmail.com designates 209.85.161.42 as permitted sender)
Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 05:28:51 +0000
Received: by fxm11 with SMTP id 11so4793234fxm.1
for ; Mon, 24 Jan 2011 21:28:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:mime-version:in-reply-to:references:date
:message-id:subject:from:to:content-type:content-transfer-encoding;
bh=HLVD7RqnwFlP4k4vSjlVmodikivQNfFK/m0rbM1C4Kc=;
b=j3jbEa89cuxoIe7Xz4GHs1t4sXVQ+MsK8s1QpbO//35XXfK3ZWEsxoLTJ00jzvRepe
KmPbRlpB1zGQnJ8kdNps8rDBBon1QAKU7DLs/BWF69GBF492MtJpShJVBd7Blavbdxo6
Yl6VjZlHt2HOkLhzbHkuIJy3DRHzdah6ATQHU=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type:content-transfer-encoding;
b=TMkflKXXpae7r+Zse9lBlSfaSL4U1UbrtEmqW3Uh0fpQ3XH1beGo4HAqFEg10kTkIA
nZ31bJOjXmR1T88QzQF8noNjZ4M3zz9uHXuBmJPphfhtFKjuqaxc8OFqqEZAvDCF5EUZ
P8sJ3wclpwyDKLyLWpbdaesPL8pKpiFqEofOQ=
MIME-Version: 1.0
Received: by 10.223.96.78 with SMTP id g14mr1591189fan.15.1295933310338; Mon,
24 Jan 2011 21:28:30 -0800 (PST)
Received: by 10.223.83.204 with HTTP; Mon, 24 Jan 2011 21:28:30 -0800 (PST)
In-Reply-To:
References:
<4D370847.7030006@windwardsolutions.com>
<4D387EBC.3090802@windwardsolutions.com>
Date: Mon, 24 Jan 2011 21:28:30 -0800
Message-ID:
Subject: Re: Meanshift clustering
From: Lance Norskog
To: user@mahout.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Sure, please. Please include hints about the various cases where it
works (or does not).
Thanks!
On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev wrote:
> Hello,
>
> In fact my estimation technique works only for the Meanshift algorithm,
> because in it the centers of the canopies are moving around. With pure
> Canopy clustering I don't think it will work.
>
> I spent some time trying to realize the idea of different kernel profiles
> with the meanshift clustering algorithm and here are the results:
> 1. I changed a little bit the original algorithm. Previously when one
> cluster "touched" other clusters its centroid was computed only based on =
the
> centroids of the clusters it touched, but not based on his centroid itsel=
f.
> Now befor calculating the shift I added one more line which makes the
> cluster observe his personal centroid:
>
> public boolean shiftToMean(MeanShiftCanopy canopy) {
> =C2=A0 =C2=A0canopy.observe(canopy.getCenter(), canopy.getBoundPoints().s=
ize());
> =C2=A0 =C2=A0canopy.computeConvergence(measure, convergenceDelta);
> =C2=A0 =C2=A0canopy.computeParameters();
> =C2=A0 =C2=A0return canopy.isConverged();
> =C2=A0}
>
> With this modification also the problem with convergence of the algorithm
> (that I described above) disappeared, although the number of iterations
> until convergence increased slightly.
> This change was necessary in order to introduce the other types of "soft"
> kernels.
>
> 2. I introduced IKernelProfile interface which has the methods:
> public double calculateValue(double distance, double h);
> =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub=
le h);
>
> 3. I created 2 implementations:
> TriangularKernelProfile with calculated value:
> @Override
> =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub=
le h) {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0return (distance < h) ? 1.0 : 0.0;
> =C2=A0 =C2=A0}
>
> and NormalKernelProfile with calculated value:
> @Override
> =C2=A0 =C2=A0public double calculateDerivativeValue(double distance, doub=
le h) {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0return UncommonDistributions.dNorm(distance, 0=
.0, h);
> =C2=A0 =C2=A0}
>
> 4. I modified the core for merging canopies:
>
> public void mergeCanopy(MeanShiftCanopy aCanopy, Collection
> canopies) {
> =C2=A0 =C2=A0MeanShiftCanopy closestCoveringCanopy =3D null;
> =C2=A0 =C2=A0double closestNorm =3D Double.MAX_VALUE;
> =C2=A0 =C2=A0for (MeanShiftCanopy canopy : canopies) {
> =C2=A0 =C2=A0 =C2=A0double norm =3D measure.distance(canopy.getCenter(),
> aCanopy.getCenter());
> =C2=A0 =C2=A0 =C2=A0double weight =3D kernelProfile.calculateDerivativeVa=
lue(norm, t1);
> =C2=A0 =C2=A0 =C2=A0if(weight > 0.0)
> =C2=A0 =C2=A0 =C2=A0{
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0aCanopy.touch(canopy, weight);
> =C2=A0 =C2=A0 =C2=A0}
> =C2=A0 =C2=A0 =C2=A0if (norm < t2 && ((closestCoveringCanopy =3D=3D null)=
|| (norm <
> closestNorm))) {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0closestNorm =3D norm;
> =C2=A0 =C2=A0 =C2=A0 =C2=A0closestCoveringCanopy =3D canopy;
> =C2=A0 =C2=A0 =C2=A0}
> =C2=A0 =C2=A0}
> =C2=A0 =C2=A0if (closestCoveringCanopy =3D=3D null) {
> =C2=A0 =C2=A0 =C2=A0canopies.add(aCanopy);
> =C2=A0 =C2=A0} else {
> =C2=A0 =C2=A0 =C2=A0closestCoveringCanopy.merge(aCanopy);
> =C2=A0 =C2=A0}
> =C2=A0}
>
> 5. I modified MeanShiftCanopy
> void touch(MeanShiftCanopy canopy, double weight) {
> =C2=A0 =C2=A0canopy.observe(getCenter(), weight*((double)boundPoints.size=
()));
> =C2=A0 =C2=A0observe(canopy.getCenter(), weight*((double)canopy.boundPoin=
ts.size()));
> =C2=A0}
>
> 6. I modified some other files which were necessary to compile the code a=
nd
> for the tests to pass
>
> With the so changed algorithm I had the following observations:
>
> 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> content is more intermixed
> 2. As there is no convergence problem any more I found the following
> procedure for estimating T2 and convergence delta:
> =C2=A0- bind convergence delta =3D T2 / 2
> =C2=A0- When you decrease T2 the number of iterations increases and the n=
umber of
> resulting clusters after convergence decreases
> =C2=A0- You come to a moment when you decrease T2 the number of iteration=
s
> increases, but the number of resulting clusters remains unchanged. This i=
s
> the point with the best value for T2
> 3. In case of Normal kernel what you give as T1 is in fact the standard
> deviation of a normal distribution, so the radius of the window will be T=
1^2
>
> If you are interested I can send the code.
>
> Regards, Vasil
>
> On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog wrote:
>
>> Vasil-
>>
>> Would you consider adding your estimation algorithm to this patch?
>> https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> The estimator in there now is stupid- a real one would make the Canopy
>> algorithms orders of magnitude more useful.
>>
>> Lance
>>
>> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning
>> wrote:
>> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev
>> wrote:
>> >
>> >>
>> >> dimension 1: Using linear regression with gradient descent algorithm =
I
>> find
>> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> straight
>> >> line
>> >> dimension 2: Knowing the approximating line (from the linear regressi=
on)
>> I
>> >> count how many times this line gets crossed by the original signal. T=
his
>> >> helps in separating the cyclic data from all the rest
>> >> dimension 3: What is the biggest increase/decrease of a single signal
>> line.
>> >> This helps find shifts
>> >>
>> >> So to say - I put a semantics for the data that are to be clustered (=
I
>> >> don't
>> >> know if it is correct to do that, but I couldn't think of how an
>> algorithm
>> >> could cope with the task without such additional semantics)
>> >>
>> >
>> > It is very common for feature extraction like this to be the key for
>> > data-mining projects. =C2=A0 Such features are absolutely critical for=
most
>> time
>> > series mining and are highly application dependent.
>> >
>> > One key aspect of your features is that they are shift invariant.
>> >
>> >
>> >> Also I developed a small swing application which visualizes the
>> clustered
>> >> signals and which helped me in playing with the algorithms.
>> >>
>> >
>> > Great idea.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
--=20
Lance Norskog
goksron@gmail.com