spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guillaume Pitel (eXenSa)" <guillaume.pi...@exensa.com>
Subject Re: K-means faster on Mahout then on Spark
Date Tue, 25 Mar 2014 13:36:57 GMT
Maybe with "MEMORY_ONLY", spark has to recompute the RDD several times because they don't fit
in memory. It makes things run slower.

As a general safe rule, use MEMORY_AND_DISK_SER



Guillaume Pitel - Président d'eXenSa 

Prashant Sharma <scrapcodes@gmail.com> a écrit :

>I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative. 
>
>
>Prashant Sharma
>
>
>
>On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov <pahomov.egor@gmail.com> wrote:
>
>Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next results
for k-means:
>Number of iterations= 10, number of elements = 10000000, mahouttime= 602, spark time =
138
>Number of iterations= 40, number of elements = 10000000, mahouttime= 1917, spark time
= 330
>Number of iterations= 70, number of elements = 10000000, mahouttime= 3203, spark time
= 388
>Number of iterations= 10, number of elements = 100000000, mahouttime= 1235, spark time
= 2226
>Number of iterations= 40, number of elements = 100000000, mahouttime= 2755, spark time
= 6388
>Number of iterations= 70, number of elements = 100000000, mahouttime= 4107, spark time
= 10967
>Number of iterations= 10, number of elements = 1000000000, mahouttime= 7070, spark time
= 25268
>
>Time in seconds. It runs on Yarn cluster with about 40 machines. Elements for clusterization
are randomly created. When I changed persistence level from Memory to Memory_and_disk, on
big data spark started to work faster.
>
>What am I missing?
>
>See my benchmarking code in attachment.
>
>
>
>-- 
>
>Sincerely yours
>Egor Pakhomov
>Scala Developer, Yandex
>
>
Mime
View raw message