spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Updated] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
Date Wed, 18 May 2016 09:49:12 GMT


Sean Owen updated SPARK-15346:
       Assignee: Abraham Zhan
       Priority: Minor  (was: Major)
    Component/s:     (was: MLlib)

> Reduce duplicate computation in picking initial points in LocalKMeans
> ---------------------------------------------------------------------
>                 Key: SPARK-15346
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>         Environment: Ubuntu 14.04
>            Reporter: Abraham Zhan
>            Assignee: Abraham Zhan
>            Priority: Minor
>              Labels: performance
> h2.Main Issue
> I found that for KMans|| in mllib, when dataset is in large scale, after initial KMeans||
finishes and before Lloyd's iteration begins, the program will stuck for a long time without
terminal. After testing I see it's stucked with LocalKMeans. And there is a to be improved
feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary
to compute the distances between all the points and the old centers as below
> {code:scala}
> val costArray = { point =>
>       KMeans.fastSquaredDistance(point, centers(0))
>     }
> {code}
> Instead this we can keep the distance between all the points and their closest centers,
and compare to the distance of them with the new center then update them.
> h2.Test
> Download [|]
> I provided a attach "" which contains the code "LocalKMeans2.scala" and
dataset "bigKMeansMedia" 
> LocalKMeans2.scala contains both original version method KMeansPlusPlus and a modified
version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
> I added a tests and main function in it so that any one can run the file directly.
> h3.How to Test
> Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans2.scala
or just  put them in the same dir. 
> Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia
which is also provided in the patch. 
> Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering
number K and iteration number respectively. 
> Then the console will print the cost time and SE of the two version of KMeans++ respectively.
> h2.Test Results
> This data is generated from a KMeans|| eperiment in spark, I add some inner function
and output the result of KMeans|| initialization and restore.
> The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration
num (in original KMeans||):points num" of the data. 
> In my machine the experiment result is as below:
> !!
>  the x-axis is the clustering num k while y-axis is the time in seconds

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message