I have put more detail of my problem at http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I have tried various settings and ways to load/partition my data, but I just cannot get rid that long pause.


On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshen84@gmail.com> wrote:
Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...

On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <joseph@databricks.com> wrote:
Can you try specifying the number of partitions when you load the data to equal the number of executors?  If your ETL changes the number of partitions, you can also repartition before calling KMeans.

On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshen84@gmail.com> wrote:

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI, all data are loaded onto one executor. I checked that executor, and its CPU workload is very low. I think it is using only 1 of the 8 cores. And all other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4 executors?