spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <hector....@gmail.com>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Tue, 08 Jul 2014 21:00:40 GMT
K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector.yee@gmail.com> wrote:
>
> > No idea, never looked it up. Always just implemented it as doing k-means
> > again on each cluster.
> >
> > FWIW standard k-means with euclidean distance has problems too with some
> > dimensionality reduction methods. Swapping out the distance metric with
> > negative dot or cosine may help.
> >
> > Other more useful clustering would be hierarchical SVD. The reason why I
> > like hierarchical clustering is it makes for faster inference especially
> > over billions of users.
> >
> >
> > On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > Hector, could you share the references for hierarchical K-means?
> thanks.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector.yee@gmail.com>
> wrote:
> > >
> > > > I would say for bigdata applications the most useful would be
> > > hierarchical
> > > > k-means with back tracking and the ability to support k nearest
> > > centroids.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> > > > > It would benefit from having implementations of other clustering
> > > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > > Clustering, and Affinity Propagation.
> > > > >
> > > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> > > > > and I saw an email on this list about interest in implementing
> Fuzzy
> > > > > C-Means.
> > > > >
> > > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > > apparent that before I implement more clustering algorithms, it
> would
> > > > > be useful to hammer out a framework to reduce code duplication and
> > > > > implement a consistent API.
> > > > >
> > > > > I'd like to gauge the interest and goals of the MLlib community:
> > > > >
> > > > > 1. Are you interested in having more clustering algorithms
> available?
> > > > >
> > > > > 2. Is the community interested in specifying a common framework?
> > > > >
> > > > > Thanks!
> > > > > RJ
> > > > >
> > > > > [1] - https://github.com/apache/spark/pull/1248
> > > > >
> > > > >
> > > > > --
> > > > > em rnowling@gmail.com
> > > > > c 954.496.2314
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message