spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (JIRA)" <>
Subject [jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
Date Tue, 21 Feb 2017 03:34:44 GMT


zhengruifeng commented on SPARK-14174:

[~srowen] the {{run}} changes were already merged, so we can continue working on this?

> Accelerate KMeans via Mini-Batch EM
> -----------------------------------
>                 Key: SPARK-14174
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: zhengruifeng
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce
the computation time, while still attempting to optimise the same objective function. Mini-batches
are subsets of the input data, randomly sampled in each training iteration. These mini-batches
drastically reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means
produces results that are generally only slightly worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs =
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||",
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||",
miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||",
miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence have significant
> {code}
> KMeans                                                    2876sec
> MiniBatch KMeans (fraction=0.1)             263sec
> MiniBatch KMeans (fraction=0.01)           90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be downloaded here (
> Comparison of the K-Means and MiniBatchKMeans on sklearn :

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message