Sure, please. Please include hints about the various cases where it
works (or does not).
Thanks!
On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev wrote:
> Hello,
>
> In fact my estimation technique works only for the Meanshift algorithm,
> because in it the centers of the canopies are moving around. With pure
> Canopy clustering I don't think it will work.
>
> I spent some time trying to realize the idea of different kernel profiles
> with the meanshift clustering algorithm and here are the results:
> 1. I changed a little bit the original algorithm. Previously when one
> cluster "touched" other clusters its centroid was computed only based on the
> centroids of the clusters it touched, but not based on his centroid itself.
> Now befor calculating the shift I added one more line which makes the
> cluster observe his personal centroid:
>
> public boolean shiftToMean(MeanShiftCanopy canopy) {
> canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
> canopy.computeConvergence(measure, convergenceDelta);
> canopy.computeParameters();
> return canopy.isConverged();
> }
>
> With this modification also the problem with convergence of the algorithm
> (that I described above) disappeared, although the number of iterations
> until convergence increased slightly.
> This change was necessary in order to introduce the other types of "soft"
> kernels.
>
> 2. I introduced IKernelProfile interface which has the methods:
> public double calculateValue(double distance, double h);
> public double calculateDerivativeValue(double distance, double h);
>
> 3. I created 2 implementations:
> TriangularKernelProfile with calculated value:
> @Override
> public double calculateDerivativeValue(double distance, double h) {
> return (distance < h) ? 1.0 : 0.0;
> }
>
> and NormalKernelProfile with calculated value:
> @Override
> public double calculateDerivativeValue(double distance, double h) {
> return UncommonDistributions.dNorm(distance, 0.0, h);
> }
>
> 4. I modified the core for merging canopies:
>
> public void mergeCanopy(MeanShiftCanopy aCanopy, Collection
> canopies) {
> MeanShiftCanopy closestCoveringCanopy = null;
> double closestNorm = Double.MAX_VALUE;
> for (MeanShiftCanopy canopy : canopies) {
> double norm = measure.distance(canopy.getCenter(),
> aCanopy.getCenter());
> double weight = kernelProfile.calculateDerivativeValue(norm, t1);
> if(weight > 0.0)
> {
> aCanopy.touch(canopy, weight);
> }
> if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> closestNorm))) {
> closestNorm = norm;
> closestCoveringCanopy = canopy;
> }
> }
> if (closestCoveringCanopy == null) {
> canopies.add(aCanopy);
> } else {
> closestCoveringCanopy.merge(aCanopy);
> }
> }
>
> 5. I modified MeanShiftCanopy
> void touch(MeanShiftCanopy canopy, double weight) {
> canopy.observe(getCenter(), weight*((double)boundPoints.size()));
> observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size()));
> }
>
> 6. I modified some other files which were necessary to compile the code and
> for the tests to pass
>
> With the so changed algorithm I had the following observations:
>
> 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> content is more intermixed
> 2. As there is no convergence problem any more I found the following
> procedure for estimating T2 and convergence delta:
> - bind convergence delta = T2 / 2
> - When you decrease T2 the number of iterations increases and the number of
> resulting clusters after convergence decreases
> - You come to a moment when you decrease T2 the number of iterations
> increases, but the number of resulting clusters remains unchanged. This is
> the point with the best value for T2
> 3. In case of Normal kernel what you give as T1 is in fact the standard
> deviation of a normal distribution, so the radius of the window will be T1^2
>
> If you are interested I can send the code.
>
> Regards, Vasil
>
> On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog wrote:
>
>> Vasil-
>>
>> Would you consider adding your estimation algorithm to this patch?
>> https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> The estimator in there now is stupid- a real one would make the Canopy
>> algorithms orders of magnitude more useful.
>>
>> Lance
>>
>> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning
>> wrote:
>> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev
>> wrote:
>> >
>> >>
>> >> dimension 1: Using linear regression with gradient descent algorithm I
>> find
>> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> straight
>> >> line
>> >> dimension 2: Knowing the approximating line (from the linear regression)
>> I
>> >> count how many times this line gets crossed by the original signal. This
>> >> helps in separating the cyclic data from all the rest
>> >> dimension 3: What is the biggest increase/decrease of a single signal
>> line.
>> >> This helps find shifts
>> >>
>> >> So to say - I put a semantics for the data that are to be clustered (I
>> >> don't
>> >> know if it is correct to do that, but I couldn't think of how an
>> algorithm
>> >> could cope with the task without such additional semantics)
>> >>
>> >
>> > It is very common for feature extraction like this to be the key for
>> > data-mining projects. Such features are absolutely critical for most
>> time
>> > series mining and are highly application dependent.
>> >
>> > One key aspect of your features is that they are shift invariant.
>> >
>> >
>> >> Also I developed a small swing application which visualizes the
>> clustered
>> >> signals and which helped me in playing with the algorithms.
>> >>
>> >
>> > Great idea.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
--
Lance Norskog
goksron@gmail.com