spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: K-means faster on Mahout then on Spark
Date Tue, 25 Mar 2014 13:58:40 GMT
Mahout does have a kmeans which can be executed in mapreduce and iterative modes.

Sent from my iPhone

> On Mar 25, 2014, at 9:25 AM, Prashant Sharma <scrapcodes@gmail.com> wrote:
> 
> I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative.

> 
> Prashant Sharma
> 
> 
>> On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov <pahomov.egor@gmail.com> wrote:
>> Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next
results for k-means:
>> Number of iterations= 10, number of elements = 10000000, mahouttime= 602, spark time
= 138
>> Number of iterations= 40, number of elements = 10000000, mahouttime= 1917, spark
time = 330
>> Number of iterations= 70, number of elements = 10000000, mahouttime= 3203, spark
time = 388
>> Number of iterations= 10, number of elements = 100000000, mahouttime= 1235, spark
time = 2226
>> Number of iterations= 40, number of elements = 100000000, mahouttime= 2755, spark
time = 6388
>> Number of iterations= 70, number of elements = 100000000, mahouttime= 4107, spark
time = 10967
>> Number of iterations= 10, number of elements = 1000000000, mahouttime= 7070, spark
time = 25268
>> 
>> Time in seconds. It runs on Yarn cluster with about 40 machines. Elements for clusterization
are randomly created. When I changed persistence level from Memory to Memory_and_disk, on
big data spark started to work faster.
>> 
>> What am I missing?
>> 
>> See my benchmarking code in attachment.
>> 
>> 
>> -- 
>> Sincerely yours
>> Egor Pakhomov
>> Scala Developer, Yandex
> 

Mime
View raw message