[ https://issues.apache.org/jira/browse/SPARK14174?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
zhengruifeng updated SPARK14174:

Description:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses minibatches to reduce
the computation time, while still attempting to optimise the same objective function. Minibatches
are subsets of the input data, randomly sampled in each training iteration. These minibatches
drastically reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of kmeans, minibatch kmeans
produces results that are generally only slightly worse than the standard algorithm.
I have implemented minibatch kmeans in Mllib, and the acceleration is realy significant.
The MiniBatch KMeans is named XMeans in following lines.
{code}
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()
val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8
val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8
val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
{code}
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence have significant
difference:
{code}
KMeans 2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01) 90sec
{code}
With appropriate fraction, the bigger the dataset is, the higher speedup is.
The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
was:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses minibatches to reduce
the computation time, while still attempting to optimise the same objective function. Minibatches
are subsets of the input data, randomly sampled in each training iteration. These minibatches
drastically reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of kmeans, minibatch kmeans
produces results that are generally only slightly worse than the standard algorithm.
I have implemented minibatch kmeans in Mllib, and the acceleration is realy significant.
The MiniBatch KMeans is named XMeans in following lines.
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()
val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8
val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8
val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence have significant
difference:
KMeans 2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01) 90sec
With appropriate fraction, the bigger the dataset is, the higher speedup is.
The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Accelerate KMeans via MiniBatch EM
> 
>
> Key: SPARK14174
> URL: https://issues.apache.org/jira/browse/SPARK14174
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: zhengruifeng
> Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses minibatches to reduce
the computation time, while still attempting to optimise the same objective function. Minibatches
are subsets of the input data, randomly sampled in each training iteration. These minibatches
drastically reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of kmeans, minibatch kmeans
produces results that are generally only slightly worse than the standard algorithm.
> I have implemented minibatch kmeans in Mllib, and the acceleration is realy significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="kmeans",
miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence have significant
difference:
> {code}
> KMeans 2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01) 90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
