Sure, please. Please include hints about the various cases where it works (or does not). Thanks! On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev wrote: > Hello, > > In fact my estimation technique works only for the Meanshift algorithm, > because in it the centers of the canopies are moving around. With pure > Canopy clustering I don't think it will work. > > I spent some time trying to realize the idea of different kernel profiles > with the meanshift clustering algorithm and here are the results: > 1. I changed a little bit the original algorithm. Previously when one > cluster "touched" other clusters its centroid was computed only based on the > centroids of the clusters it touched, but not based on his centroid itself. > Now befor calculating the shift I added one more line which makes the > cluster observe his personal centroid: > > public boolean shiftToMean(MeanShiftCanopy canopy) { >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); >    canopy.computeConvergence(measure, convergenceDelta); >    canopy.computeParameters(); >    return canopy.isConverged(); >  } > > With this modification also the problem with convergence of the algorithm > (that I described above) disappeared, although the number of iterations > until convergence increased slightly. > This change was necessary in order to introduce the other types of "soft" > kernels. > > 2. I introduced IKernelProfile interface which has the methods: > public double calculateValue(double distance, double h); >    public double calculateDerivativeValue(double distance, double h); > > 3. I created 2 implementations: > TriangularKernelProfile with calculated value: > @Override >    public double calculateDerivativeValue(double distance, double h) { >        return (distance < h) ? 1.0 : 0.0; >    } > > and NormalKernelProfile with calculated value: > @Override >    public double calculateDerivativeValue(double distance, double h) { >        return UncommonDistributions.dNorm(distance, 0.0, h); >    } > > 4. I modified the core for merging canopies: > > public void mergeCanopy(MeanShiftCanopy aCanopy, Collection > canopies) { >    MeanShiftCanopy closestCoveringCanopy = null; >    double closestNorm = Double.MAX_VALUE; >    for (MeanShiftCanopy canopy : canopies) { >      double norm = measure.distance(canopy.getCenter(), > aCanopy.getCenter()); >      double weight = kernelProfile.calculateDerivativeValue(norm, t1); >      if(weight > 0.0) >      { >          aCanopy.touch(canopy, weight); >      } >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > closestNorm))) { >        closestNorm = norm; >        closestCoveringCanopy = canopy; >      } >    } >    if (closestCoveringCanopy == null) { >      canopies.add(aCanopy); >    } else { >      closestCoveringCanopy.merge(aCanopy); >    } >  } > > 5. I modified MeanShiftCanopy > void touch(MeanShiftCanopy canopy, double weight) { >    canopy.observe(getCenter(), weight*((double)boundPoints.size())); >    observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size())); >  } > > 6. I modified some other files which were necessary to compile the code and > for the tests to pass > > With the so changed algorithm I had the following observations: > > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > content is more intermixed > 2. As there is no convergence problem any more I found the following > procedure for estimating T2 and convergence delta: >  - bind convergence delta = T2 / 2 >  - When you decrease T2 the number of iterations increases and the number of > resulting clusters after convergence decreases >  - You come to a moment when you decrease T2 the number of iterations > increases, but the number of resulting clusters remains unchanged. This is > the point with the best value for T2 > 3. In case of Normal kernel what you give as T1 is in fact the standard > deviation of a normal distribution, so the radius of the window will be T1^2 > > If you are interested I can send the code. > > Regards, Vasil > > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog wrote: > >> Vasil- >> >> Would you consider adding your estimation algorithm to this patch? >> https://issues.apache.org/jira/browse/MAHOUT-563 >> >> The estimator in there now is stupid- a real one would make the Canopy >> algorithms orders of magnitude more useful. >> >> Lance >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning >> wrote: >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev >> wrote: >> > >> >> >> >> dimension 1: Using linear regression with gradient descent algorithm I >> find >> >> what is the trend of the line, i.e. is it increasing, decreasing or >> >> straight >> >> line >> >> dimension 2: Knowing the approximating line (from the linear regression) >> I >> >> count how many times this line gets crossed by the original signal. This >> >> helps in separating the cyclic data from all the rest >> >> dimension 3: What is the biggest increase/decrease of a single signal >> line. >> >> This helps find shifts >> >> >> >> So to say - I put a semantics for the data that are to be clustered (I >> >> don't >> >> know if it is correct to do that, but I couldn't think of how an >> algorithm >> >> could cope with the task without such additional semantics) >> >> >> > >> > It is very common for feature extraction like this to be the key for >> > data-mining projects.   Such features are absolutely critical for most >> time >> > series mining and are highly application dependent. >> > >> > One key aspect of your features is that they are shift invariant. >> > >> > >> >> Also I developed a small swing application which visualizes the >> clustered >> >> signals and which helped me in playing with the algorithms. >> >> >> > >> > Great idea. >> > >> >> >> >> -- >> Lance Norskog >> goksron@gmail.com >> > -- Lance Norskog goksron@gmail.com