spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lydia Ickler <>
Subject PowerIterationClustering Benchmark
Date Thu, 15 Dec 2016 22:04:27 GMT
Hi all,

I have a question regarding the PowerIterationClusteringExample.
I have adjusted the code so that it reads a file via „sc.textFile(„path/to/input“)“
which works fine.

Now I wanted to benchmark the algorithm using different number of nodes to see how well the
implementation scales. As a testbed I have up to 32 nodes available, each with 16 cores and
Spark 2.0.2 on Yarn running.
For my smallest input data set (16MB) the runtime does not really change if I use 1,2,4,8,16
or 32 nodes. (always ~ 1.5 minute)
Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I use 16 or
if I use 32 nodes.

I was expecting that when I e.g. double the number of nodes the runtime would shrink. 
As for setting up my cluster environment I tried different suggestions from this paper

Has someone experienced the same? Or has someone suggestions what might went wrong?

Thanks in advance!

View raw message