spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Wed, 09 Jul 2014 12:15:20 GMT
Thanks everyone for the input.

So it seems what people want is:

* Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
conquer approach, look at DecisionTree implementation as a reference)
* Restructure 3 Kmeans clustering algorithm implementations to prevent
code duplication and conform to a consistent API where possible

If this is correct, I'll start work on that.  How would it be best to
structure it? Should I submit separate JIRAs / PRs for refactoring of
current code, MiniBatch KMeans, and Hierarchical or keep my current
JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
for Hierarchical KMeans that builds on top of that?

Thanks!


On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector.yee@gmail.com> wrote:
> Yeah if one were to replace the objective function in decision tree with
> minimizing the variance of the leaf nodes it would be a hierarchical
> clusterer.
>
>
> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.sparks@gmail.com>
> wrote:
>
>> If you're thinking along these lines, have a look at the DecisionTree
>> implementation in MLlib. It uses the same idea and is optimized to prevent
>> multiple passes over the data by computing several splits at each level of
>> tree building. The tradeoff is increased model state and computation per
>> pass over the data, but fewer total passes and hopefully lower
>> communication overheads than, say, shuffling data around that belongs to
>> one cluster or another. Something like that could work here as well.
>>
>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>> efficient way to implement it, though.
>>
>>
>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector.yee@gmail.com> wrote:
>>
>> > No was thinking more top-down:
>> >
>> > assuming a distributed kmeans system already existing, recursively apply
>> > the kmeans algorithm on data already partitioned by the previous level of
>> > kmeans.
>> >
>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>> > because they assume there is already a distance metric for items to
>> items.
>> > This makes it hard to cluster new content. The distances between sibling
>> > clusters is also hard to compute (if you have thrown away the similarity
>> > matrix), do you count paths to same parent node if you are computing
>> > distances between items in two adjacent nodes for example. It is also a
>> bit
>> > harder to distribute the computation for bottom up approaches as you have
>> > to already find the nearest neighbor to an item to begin the process.
>> >
>> >
>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowling@gmail.com> wrote:
>> >
>> > > The scikit-learn implementation may be of interest:
>> > >
>> > >
>> > >
>> >
>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>> > >
>> > > It's a bottom up approach.  The pair of clusters for merging are
>> > > chosen to minimize variance.
>> > >
>> > > Their code is under a BSD license so it can be used as a template.
>> > >
>> > > Is something like that you were thinking Hector?
>> > >
>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > > wrote:
>> > > > sure. more interesting problem here is choosing k at each level.
>> Kernel
>> > > > methods seem to be most promising.
>> > > >
>> > > >
>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector.yee@gmail.com>
>> > wrote:
>> > > >
>> > > >> No idea, never looked it up. Always just implemented it as doing
>> > k-means
>> > > >> again on each cluster.
>> > > >>
>> > > >> FWIW standard k-means with euclidean distance has problems too
with
>> > some
>> > > >> dimensionality reduction methods. Swapping out the distance metric
>> > with
>> > > >> negative dot or cosine may help.
>> > > >>
>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>> > why I
>> > > >> like hierarchical clustering is it makes for faster inference
>> > especially
>> > > >> over billions of users.
>> > > >>
>> > > >>
>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> > > >> wrote:
>> > > >>
>> > > >> > Hector, could you share the references for hierarchical K-means?
>> > > thanks.
>> > > >> >
>> > > >> >
>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector.yee@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > > I would say for bigdata applications the most useful
would be
>> > > >> > hierarchical
>> > > >> > > k-means with back tracking and the ability to support
k nearest
>> > > >> > centroids.
>> > > >> > >
>> > > >> > >
>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
>> >
>> > > >> wrote:
>> > > >> > >
>> > > >> > > > Hi all,
>> > > >> > > >
>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>> > > KMeans.
>> > > >> > > > It would benefit from having implementations of
other
>> clustering
>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>> Hierarchical
>> > > >> > > > Clustering, and Affinity Propagation.
>> > > >> > > >
>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>> > > implementation,
>> > > >> > > > and I saw an email on this list about interest
in implementing
>> > > Fuzzy
>> > > >> > > > C-Means.
>> > > >> > > >
>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans
code, it
>> > became
>> > > >> > > > apparent that before I implement more clustering
algorithms,
>> it
>> > > would
>> > > >> > > > be useful to hammer out a framework to reduce code
duplication
>> > and
>> > > >> > > > implement a consistent API.
>> > > >> > > >
>> > > >> > > > I'd like to gauge the interest and goals of the
MLlib
>> community:
>> > > >> > > >
>> > > >> > > > 1. Are you interested in having more clustering
algorithms
>> > > available?
>> > > >> > > >
>> > > >> > > > 2. Is the community interested in specifying a
common
>> framework?
>> > > >> > > >
>> > > >> > > > Thanks!
>> > > >> > > > RJ
>> > > >> > > >
>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > em rnowling@gmail.com
>> > > >> > > > c 954.496.2314
>> > > >> > > >
>> > > >> > >
>> > > >> > >
>> > > >> > >
>> > > >> > > --
>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > > >>
>> > >
>> > >
>> > >
>> > > --
>> > > em rnowling@gmail.com
>> > > c 954.496.2314
>> > >
>> >
>> >
>> >
>> > --
>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> >
>>
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*



-- 
em rnowling@gmail.com
c 954.496.2314

Mime
View raw message