spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Jin <karen...@gmail.com>
Subject Please Help: Amplab Benchmark Performance
Date Thu, 30 Jan 2014 04:10:36 GMT
Hi All,

https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark
report. I am trying to reproduce the same set of queries in the
spark-shell so that we can understand more about shark and spark and
their performance on EC2.

As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
and Shark-mem takes 111 seconds. However, when I materialize the
results to the disk, spark-shell takes more than 5 minutes
(reduceByKey is used in the shell for aggregation) . Further, if I
cache uservisits RDD, since the dataset is way too big, the
performance deteriorates quite a lot.

Can anybody shed some light on why there is a more than 2x difference
between shark-disk and spark-shell-disk and how to cache data in spark
correctly such that we can achieve comparable performance as
shark-mem?

Thank you very much,

-chen

Mime
View raw message