spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: kmeans|| in Spark is not real paralleled?
Date Mon, 30 Mar 2015 21:18:35 GMT
This PR updated the k-means|| initialization:
https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
which was included in 1.3.0. It should fix kmean|| initialization with
large k. Please create a JIRA for this issue and send me the code and the
dataset to produce this problem. Thanks! -Xiangrui

On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen <davidshen84@gmail.com> wrote:

> Hi,
>
> I have opened a couple of threads asking about k-means performance problem
> in Spark. I think I made a little progress.
>
> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
> uses the "kmeans||" initialization algorithm which supposedly to be a
> faster version of kmeans++ and give better results in general.
>
> But I observed that if the k is very large, the initialization step takes
> a long time. From the CPU utilization chart, it looks like only one thread
> is working. Please see
> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
> .
>
> I read the paper,
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it points
> out kmeans++ initialization algorithm will suffer if k is large. That's why
> the paper contributed the kmeans|| algorithm.
>
>
> If I invoke KMeans.train by using the random initialization algorithm, I
> do not observe this problem, even with very large k, like k=5000. This
> makes me suspect that the kmeans|| in Spark is not properly implemented and
> do not utilize parallel implementation.
>
>
> I have also tested my code and data set with Spark 1.3.0, and I still
> observe this problem. I quickly checked the PR regarding the KMeans
> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
> and polish, not changing/improving the algorithm.
>
>
> I originally worked on Windows 64bit environment, and I also tested on
> Linux 64bit environment. I could provide the code and data set if anyone
> want to reproduce this problem.
>
>
> I hope a Spark developer could comment on this problem and help
> identifying if it is a bug.
>
>
> Thanks,
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>

Mime
View raw message