spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amlan Jyoti <amlan.jy...@tcs.com>
Subject KMeans Clustering result differs for 2 datasets with identical 'features'
Date Mon, 26 Mar 2018 07:14:36 GMT
Hi All,

I am trying to run Kmeans Model with attached 2 input datasets i.e. 
old.csv and new.csv [2 columns- 'cid' as label column and 'features' 
column]  in which the records differ only by the customer ID(cid) but does 
have same 'features' data mapped against each customer ID.

Snippet of 2 input data:

        old                                                     new



 

As the "features" of the 2 datasets are same, the results of KMeans model 
should have been identical; but we are getting different results. 

To illustrate, The predicted points in each cluster  varies for 2 data 
sets, as shown below:

        old.csv:
                |prediction|count|
                +----------+-----+
                |         1     | 3211  |
                |         2     |30268 |
                |         0     |16521 |
                +----------+-----+
        new.csv:
                |prediction|count|
                +----------+-----+
                |         1     |16312  |
                |         2     | 3119  |
                |         0     |30569  |
                +----------+-----+

I am using Spark ML Java using Spark's default configuration and number of 
cluster as 3. Below is the code snippet:
====================================
KMeans kmeans = new KMeans()
                        .setK(3)
                        .setMaxIter(20)
                        .setTol(1.0E-4)
                        .setInitSteps(2)
                        .setSeed(Long.valueOf(-1689246527))
                        .setFeaturesCol("features")
                        .setPredictionCol("prediction");
 
KMeansModel trainedkMeansModel = kmeans.fit(inputDataset);

clusterOutput = trainedkMeansModel.transform(inputDataset).select("cid", 
"prediction");
clusterOutput.groupBy(new Column("prediction")).count().show();

==================================== 

Is this behaviour expected? Is there anything I could do to achieve 
reproducible results?  Request you to please share your thoughts on this.


With Regards
Amlan Jyoti

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you



Mime
View raw message