spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <hector....@gmail.com>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Tue, 08 Jul 2014 21:44:00 GMT
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.


On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.sparks@gmail.com>
wrote:

> If you're thinking along these lines, have a look at the DecisionTree
> implementation in MLlib. It uses the same idea and is optimized to prevent
> multiple passes over the data by computing several splits at each level of
> tree building. The tradeoff is increased model state and computation per
> pass over the data, but fewer total passes and hopefully lower
> communication overheads than, say, shuffling data around that belongs to
> one cluster or another. Something like that could work here as well.
>
> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
> efficient way to implement it, though.
>
>
> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector.yee@gmail.com> wrote:
>
> > No was thinking more top-down:
> >
> > assuming a distributed kmeans system already existing, recursively apply
> > the kmeans algorithm on data already partitioned by the previous level of
> > kmeans.
> >
> > I haven't been much of a fan of bottom up approaches like HAC mainly
> > because they assume there is already a distance metric for items to
> items.
> > This makes it hard to cluster new content. The distances between sibling
> > clusters is also hard to compute (if you have thrown away the similarity
> > matrix), do you count paths to same parent node if you are computing
> > distances between items in two adjacent nodes for example. It is also a
> bit
> > harder to distribute the computation for bottom up approaches as you have
> > to already find the nearest neighbor to an item to begin the process.
> >
> >
> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowling@gmail.com> wrote:
> >
> > > The scikit-learn implementation may be of interest:
> > >
> > >
> > >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> > >
> > > It's a bottom up approach.  The pair of clusters for merging are
> > > chosen to minimize variance.
> > >
> > > Their code is under a BSD license so it can be used as a template.
> > >
> > > Is something like that you were thinking Hector?
> > >
> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > > > sure. more interesting problem here is choosing k at each level.
> Kernel
> > > > methods seem to be most promising.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector.yee@gmail.com>
> > wrote:
> > > >
> > > >> No idea, never looked it up. Always just implemented it as doing
> > k-means
> > > >> again on each cluster.
> > > >>
> > > >> FWIW standard k-means with euclidean distance has problems too with
> > some
> > > >> dimensionality reduction methods. Swapping out the distance metric
> > with
> > > >> negative dot or cosine may help.
> > > >>
> > > >> Other more useful clustering would be hierarchical SVD. The reason
> > why I
> > > >> like hierarchical clustering is it makes for faster inference
> > especially
> > > >> over billions of users.
> > > >>
> > > >>
> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Hector, could you share the references for hierarchical K-means?
> > > thanks.
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector.yee@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > I would say for bigdata applications the most useful would
be
> > > >> > hierarchical
> > > >> > > k-means with back tracking and the ability to support k
nearest
> > > >> > centroids.
> > > >> > >
> > > >> > >
> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > MLlib currently has one clustering algorithm implementation,
> > > KMeans.
> > > >> > > > It would benefit from having implementations of other
> clustering
> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
> Hierarchical
> > > >> > > > Clustering, and Affinity Propagation.
> > > >> > > >
> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > > implementation,
> > > >> > > > and I saw an email on this list about interest in implementing
> > > Fuzzy
> > > >> > > > C-Means.
> > > >> > > >
> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans
code, it
> > became
> > > >> > > > apparent that before I implement more clustering algorithms,
> it
> > > would
> > > >> > > > be useful to hammer out a framework to reduce code
duplication
> > and
> > > >> > > > implement a consistent API.
> > > >> > > >
> > > >> > > > I'd like to gauge the interest and goals of the MLlib
> community:
> > > >> > > >
> > > >> > > > 1. Are you interested in having more clustering algorithms
> > > available?
> > > >> > > >
> > > >> > > > 2. Is the community interested in specifying a common
> framework?
> > > >> > > >
> > > >> > > > Thanks!
> > > >> > > > RJ
> > > >> > > >
> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > em rnowling@gmail.com
> > > >> > > > c 954.496.2314
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >>
> > >
> > >
> > >
> > > --
> > > em rnowling@gmail.com
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message