mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasil Vasilev <vavasi...@gmail.com>
Subject Re: Meanshift clustering
Date Tue, 25 Jan 2011 15:33:00 GMT
It is added to https://issues.apache.org/jira/browse/MAHOUT-15

I hope this was the correct place to add it!

On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <goksron@gmail.com> wrote:

> Sure, please. Please include hints about the various cases where it
> works (or does not).
>
> Thanks!
>
> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <vavasilev@gmail.com>
> wrote:
> > Hello,
> >
> > In fact my estimation technique works only for the Meanshift algorithm,
> > because in it the centers of the canopies are moving around. With pure
> > Canopy clustering I don't think it will work.
> >
> > I spent some time trying to realize the idea of different kernel profiles
> > with the meanshift clustering algorithm and here are the results:
> > 1. I changed a little bit the original algorithm. Previously when one
> > cluster "touched" other clusters its centroid was computed only based on
> the
> > centroids of the clusters it touched, but not based on his centroid
> itself.
> > Now befor calculating the shift I added one more line which makes the
> > cluster observe his personal centroid:
> >
> > public boolean shiftToMean(MeanShiftCanopy canopy) {
> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
> >    canopy.computeConvergence(measure, convergenceDelta);
> >    canopy.computeParameters();
> >    return canopy.isConverged();
> >  }
> >
> > With this modification also the problem with convergence of the algorithm
> > (that I described above) disappeared, although the number of iterations
> > until convergence increased slightly.
> > This change was necessary in order to introduce the other types of "soft"
> > kernels.
> >
> > 2. I introduced IKernelProfile interface which has the methods:
> > public double calculateValue(double distance, double h);
> >    public double calculateDerivativeValue(double distance, double h);
> >
> > 3. I created 2 implementations:
> > TriangularKernelProfile with calculated value:
> > @Override
> >    public double calculateDerivativeValue(double distance, double h) {
> >        return (distance < h) ? 1.0 : 0.0;
> >    }
> >
> > and NormalKernelProfile with calculated value:
> > @Override
> >    public double calculateDerivativeValue(double distance, double h) {
> >        return UncommonDistributions.dNorm(distance, 0.0, h);
> >    }
> >
> > 4. I modified the core for merging canopies:
> >
> > public void mergeCanopy(MeanShiftCanopy aCanopy,
> Collection<MeanShiftCanopy>
> > canopies) {
> >    MeanShiftCanopy closestCoveringCanopy = null;
> >    double closestNorm = Double.MAX_VALUE;
> >    for (MeanShiftCanopy canopy : canopies) {
> >      double norm = measure.distance(canopy.getCenter(),
> > aCanopy.getCenter());
> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
> >      if(weight > 0.0)
> >      {
> >          aCanopy.touch(canopy, weight);
> >      }
> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> > closestNorm))) {
> >        closestNorm = norm;
> >        closestCoveringCanopy = canopy;
> >      }
> >    }
> >    if (closestCoveringCanopy == null) {
> >      canopies.add(aCanopy);
> >    } else {
> >      closestCoveringCanopy.merge(aCanopy);
> >    }
> >  }
> >
> > 5. I modified MeanShiftCanopy
> > void touch(MeanShiftCanopy canopy, double weight) {
> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
> >    observe(canopy.getCenter(),
> weight*((double)canopy.boundPoints.size()));
> >  }
> >
> > 6. I modified some other files which were necessary to compile the code
> and
> > for the tests to pass
> >
> > With the so changed algorithm I had the following observations:
> >
> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> > content is more intermixed
> > 2. As there is no convergence problem any more I found the following
> > procedure for estimating T2 and convergence delta:
> >  - bind convergence delta = T2 / 2
> >  - When you decrease T2 the number of iterations increases and the number
> of
> > resulting clusters after convergence decreases
> >  - You come to a moment when you decrease T2 the number of iterations
> > increases, but the number of resulting clusters remains unchanged. This
> is
> > the point with the best value for T2
> > 3. In case of Normal kernel what you give as T1 is in fact the standard
> > deviation of a normal distribution, so the radius of the window will be
> T1^2
> >
> > If you are interested I can send the code.
> >
> > Regards, Vasil
> >
> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <goksron@gmail.com>
> wrote:
> >
> >> Vasil-
> >>
> >> Would you consider adding your estimation algorithm to this patch?
> >> https://issues.apache.org/jira/browse/MAHOUT-563
> >>
> >> The estimator in there now is stupid- a real one would make the Canopy
> >> algorithms orders of magnitude more useful.
> >>
> >> Lance
> >>
> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <ted.dunning@gmail.com>
> >> wrote:
> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <vavasilev@gmail.com>
> >> wrote:
> >> >
> >> >>
> >> >> dimension 1: Using linear regression with gradient descent algorithm
> I
> >> find
> >> >> what is the trend of the line, i.e. is it increasing, decreasing or
> >> >> straight
> >> >> line
> >> >> dimension 2: Knowing the approximating line (from the linear
> regression)
> >> I
> >> >> count how many times this line gets crossed by the original signal.
> This
> >> >> helps in separating the cyclic data from all the rest
> >> >> dimension 3: What is the biggest increase/decrease of a single signal
> >> line.
> >> >> This helps find shifts
> >> >>
> >> >> So to say - I put a semantics for the data that are to be clustered
> (I
> >> >> don't
> >> >> know if it is correct to do that, but I couldn't think of how an
> >> algorithm
> >> >> could cope with the task without such additional semantics)
> >> >>
> >> >
> >> > It is very common for feature extraction like this to be the key for
> >> > data-mining projects.   Such features are absolutely critical for most
> >> time
> >> > series mining and are highly application dependent.
> >> >
> >> > One key aspect of your features is that they are shift invariant.
> >> >
> >> >
> >> >> Also I developed a small swing application which visualizes the
> >> clustered
> >> >> signals and which helped me in playing with the algorithms.
> >> >>
> >> >
> >> > Great idea.
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message