spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Derrick Burns (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations
Date Wed, 01 Oct 2014 01:18:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154171#comment-14154171
] 

Derrick Burns commented on SPARK-3504:
--------------------------------------

Works for me!

On Mon, Sep 29, 2014 at 7:35 PM, Patrick Wendell (JIRA) <jira@apache.org>



> KMeans optimization: track distances and unmoved cluster centers across iterations
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3504
>                 URL: https://issues.apache.org/jira/browse/SPARK-3504
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because recomputes
all distances to all cluster centers on each iteration.  In later iterations of Lloyd's algorithm,
points don't change clusters and clusters don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which cluster it belongs
to and the distance to that cluster, one can avoid recomputing distances in many cases with
very little increase in memory requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 c3.8xlarge
machines on EC2,  the clusterer converged in 13 iterations on 1,714,654 (182 dimensional)
points and 20,000 clusters in 24 minutes. Here are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, resulting in
Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, this improvement resulting
in a reduction of roughly 75% in running time with no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is not a simple
modification of the existing implementation.  Please let me know if you are interested in
this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message