spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abraham Zhan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large
Date Mon, 16 May 2016 09:39:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284294#comment-15284294
] 

Abraham Zhan edited comment on SPARK-6706 at 5/16/16 9:38 AM:
--------------------------------------------------------------

h2.Main Issue
I found the actually reason why GUI does not finish, which turns out that it's stuck with
LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking
each new initial centers, it's unnecessary to compute the distances between all the points
and the old centers as below
{code:scala}
val costArray = points.map { point =>
      KMeans.fastSquaredDistance(point, centers(0))
    }
{code}

Instead this we can keep the distance between all the points and their closest centers, and
compare to the distance of them with the new center then update them.

h2.Test
Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans.scala" and dataset
"bigKMeansMedia" 
LocalKMeans.scala contains both original version method KMeansPlusPlus and a modified version
KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file directly.

h3.How to Test
Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans.scala.

Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia
which is also provided in the patch. 
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering
number K and iteration number respectively. 
Then the console will print the cost time and SE of the two version of KMeans++ respectively.

h2.Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner function and output
the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration
num (in original KMeans||):points num" of the data. 

In my machine the experiment result is as below:

!https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
 the x-axis is the clustering num k while y-axis is the time in seconds


was (Author: mouendless):
h2.Main Issue
I found a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial
centers, it's unnecessary to compute the distances between all the points and the old centers
as below
{code:scala}
val costArray = points.map { point =>
      KMeans.fastSquaredDistance(point, centers(0))
    }
{code}

Instead this we can keep the distance between all the points and their closest centers, and
compare to the distance of them with the new center then update them.

h2.Test
Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans.scala" and dataset
"bigKMeansMedia" 
LocalKMeans.scala contains both original version method KMeansPlusPlus and a modified version
KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file directly.

h3.How to Test
Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans.scala.

Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia
which is also provided in the patch. 
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering
number K and iteration number respectively. 
Then the console will print the cost time and SE of the two version of KMeans++ respectively.

h2.Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner function and output
the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration
num (in original KMeans||):points num" of the data. 

In my machine the experiment result is as below:

!https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
 the x-axis is the clustering num k while y-axis is the time in seconds

> kmeans|| hangs for a long time if both k and vector dimension are large
> -----------------------------------------------------------------------
>
>                 Key: SPARK-6706
>                 URL: https://issues.apache.org/jira/browse/SPARK-6706
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1, 1.3.0
>         Environment: Windows 64bit, Linux 64bit
>            Reporter: Xi Shen
>            Assignee: Xiangrui Meng
>              Labels: performance
>         Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default one. The
algorithm finished some {{collect()}} jobs, then the *driver* hangs for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and cluster mode.
On Spark 1.3.0. I, I can also reproduce this issue with local mode. **However, I do not have
a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message