I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI, all data are loaded onto one executor. I checked that executor, and its CPU workload is very low. I think it is using only 1 of the 8 cores. And all other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4 executors?