spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amlan Jyoti <>
Subject KMeans Clustering result differs for 2 datasets with identical 'features'
Date Mon, 26 Mar 2018 07:14:36 GMT
Hi All,

I am trying to run Kmeans Model with attached 2 input datasets i.e. 
old.csv and new.csv [2 columns- 'cid' as label column and 'features' 
column]  in which the records differ only by the customer ID(cid) but does 
have same 'features' data mapped against each customer ID.

Snippet of 2 input data:

        old                                                     new


As the "features" of the 2 datasets are same, the results of KMeans model 
should have been identical; but we are getting different results. 

To illustrate, The predicted points in each cluster  varies for 2 data 
sets, as shown below:

                |         1     | 3211  |
                |         2     |30268 |
                |         0     |16521 |
                |         1     |16312  |
                |         2     | 3119  |
                |         0     |30569  |

I am using Spark ML Java using Spark's default configuration and number of 
cluster as 3. Below is the code snippet:
KMeans kmeans = new KMeans()
KMeansModel trainedkMeansModel =;

clusterOutput = trainedkMeansModel.transform(inputDataset).select("cid", 
clusterOutput.groupBy(new Column("prediction")).count().show();


Is this behaviour expected? Is there anything I could do to achieve 
reproducible results?  Request you to please share your thoughts on this.

With Regards
Amlan Jyoti

Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

View raw message