spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Fri, 18 Jul 2014 12:05:55 GMT
Nice to meet you, Jeremy!

This is great!  Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

If it's cool with you, I'll try converting your code to fit in with
the existing MLLib code as you suggest. I also need to review the
Decision Tree code (as suggested above) to see how much of that can be

Maybe I can ask you to do a code review for me when I'm done?

On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
<> wrote:
> Hi all,
> Cool discussion! I agree that a more standardized API for clustering, and
> easy access to underlying routines, would be useful (we've also been
> discussing this when trying to develop streaming clustering algorithms,
> similar to
> For divisive, hierarchical clustering I implemented something awhile back,
> here's a gist.
> It does bisecting k-means clustering (with k=2), with a recursive class for
> keeping track of the tree. I also found this much better than agglomerative
> methods (for the reasons Hector points out).
> This needs to be cleaned up, and can surely be optimized (esp. by replacing
> the core KMeans step with existing MLLib code), but I can say I was running
> it successfully on quite large data sets.
> RJ, depending on where you are in your progress, I'd be happy to help work
> on this piece and / or have you use this as a jumping off point, if useful.
> -- Jeremy
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at

c 954.496.2314

View raw message