Thanks Satya.

 I tried setting the initSteps as 25 and the maxIteration as 500, both in R and Spark. The results provided below were from that settings.

Also, in Spark and R the center remains almost the same, but they are different from each other.


Thanks & Regards
Saroj




From:        Satya Varaprasad Allumallu <allumallu@gmail.com>
To:        Saroj C <saroj.c@tcs.com>
Cc:        User <user@spark.apache.org>
Date:        01/02/2017 08:53 PM
Subject:        Re: Difference in R and Spark Output




Can you run Spark Kmeans algorithm multiple times and see if the centers remain stable? I am
guessing it is related to random initialization of centers.

On Mon, Jan 2, 2017 at 1:34 AM, Saroj C <saroj.c@tcs.com> wrote:
Dear Felix,
 Thanks. Please find the differences

Cluster Spark - Size R- Size
0
69
114
1
79
141
2
77
93
3
90
44
4
130
53



Spark - Centers                                          
0.807554406
0.123759
-0.58642
-0.17803
0.624278
-0.06752
0.033517
-0.01504
-0.02794
0.016699
0.20841
-0.00149
-0.05598
0.039746
0.030756
-0.19788
-0.07906
-0.14881
0.0056
0.01479
0.066883
0.002491
-0.428583581
-0.81975
0.347356
-0.18664
0.047582
0.058692
-0.0721
-0.13873
-0.08666
0.085334
0.054398
-0.0228
0.008369
0.073103
0.022246
-0.15439
-0.06016
-0.15073
-0.03734
0.004299
0.089258
-0.00694
0.692744675
0.148123
0.087253
0.851781
-0.2179
0.003407
-0.12357
-0.01795
0.016427
0.088004
0.021502
-0.04616
-0.00847
0.023397
0.057656
-0.12036
-0.03947
-0.13338
-0.02975
0.012217
0.090547
-0.00232
-0.677692276
0.581091
0.446125
-0.13087
0.037225
0.018936
0.055286
0.01146
-0.08648
0.053719
0.072753
-0.00873
-0.04448
0.042067
0.089221
-0.1977
-0.07368
-0.14674
-0.00641
0.020815
0.058425
0.016745
1.03518389
0.228964
0.539982
-0.3581
-0.13488
-0.00525
-0.1267
-0.04439
-0.01923
0.111272
-0.05181
-0.05508
-0.04143
0.046479
0.059224
-0.16148
-0.07541
-0.12046
-0.03535
0.003049
0.070862
0.010083
R - Centers                                          
0.7710882
0.86271
0.249609
0.074961
0.251188
-0.05293
-0.11106
-0.08063
0.01516
0.054043
0.056937
-0.0287
-0.03291
0.056607
0.045214
-0.15237
-0.05442
-0.14038
-0.02326
0.013882
0.078523
-0.0087
-0.644077
0.022256
0.368266
-0.06912
0.123979
0.009181
-0.04506
-0.04179
-0.0255
0.041568
0.04081
-0.02936
-0.04849
0.049712
0.062894
-0.16736
-0.06679
-0.12705
-0.007
0.018079
0.062337
0.00349
0.9772678
-0.57499
0.523792
-0.27319
0.163677
0.053579
-0.07616
0.074556
0.00662
0.087303
0.088835
-0.01923
-0.04938
0.07299
0.059872
-0.19137
-0.04737
-0.1536
0.002926
0.049441
0.079147
0.02771
0.5172924
0.167666
-0.16523
-0.82951
-0.77577
-0.00981
0.018531
-0.09629
-0.1654
0.273644
-0.05433
-0.03593
0.115834
0.021465
-0.00981
-0.15112
-0.16178
-0.04783
-0.19962
-0.12418
0.07286
0.03266
0.717927
-0.34229
-0.33544
0.817617
-0.21383
0.02735
0.01675
-0.10814
-0.1747
0.033743
0.038308
-0.0495
-0.05961
-0.01977
0.092247
-0.16017
-0.04787
-0.20766
0.040038
0.024614
0.090587
-0.0236




Please let me know, if any additional info will help to find these anomalies.


Thanks & Regards
Saroj





From:        
Felix Cheung <felixcheung_m@hotmail.com>
To:        
User <user@spark.apache.org>, Saroj C <saroj.c@tcs.com>
Date:        
12/31/2016 10:36 AM
Subject:        
Re: Difference in R and Spark Output





Could you elaborate more on the huge difference you are seeing?




From:
Saroj C <
saroj.c@tcs.com>
Sent:
Friday, December 30, 2016 5:12:04 AM
To:
User
Subject:
Difference in R and Spark Output

 

Dear All,
For the attached input file, there is a huge difference between the Clusters in R and Spark(ML). Any idea, what could be the difference ?


Note we wanted to create Five(5) clusters.


Please find the snippets in Spark and R


Spark


//Load the Data file


// Create K means Cluster

       KMeans
kmeans = new KMeans().setK(5).setMaxIter(500)
                               .setFeaturesCol(
"features").setPredictionCol("prediction");


In R


//Load the Data File into df


//Create the K Means Cluster

 

model <- kmeans(df, 5)




Thanks & Regards
Saroj

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you