spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <hector....@gmail.com>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Tue, 08 Jul 2014 20:31:53 GMT
No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> Hector, could you share the references for hierarchical K-means? thanks.
>
>
> On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector.yee@gmail.com> wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowling@gmail.com
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message