spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Re: Benchmark numbers for terabytes of data
Date Wed, 04 Dec 2013 19:06:12 GMT
I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in
the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes.

I'm still pretty new to cluster computing, so just not sure how people have set these up.

-Matt Cheah

From: Matei Zaharia <matei.zaharia@gmail.com<mailto:matei.zaharia@gmail.com>>
Reply-To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
<user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Date: Wednesday, December 4, 2013 10:53 AM
To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" <user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Cc: Mingyu Kim <mkim@palantir.com<mailto:mkim@palantir.com>>
Subject: Re: Benchmark numbers for terabytes of data

Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/

The numbers on that benchmark are for Shark.

Matei

On Dec 3, 2013, at 3:50 PM, Matt Cheah <mcheah@palantir.com<mailto:mcheah@palantir.com>>
wrote:

Hi everyone,

I notice the benchmark page for AMPLab provides some numbers on Gbs of data: https://amplab.cs.berkeley.edu/benchmark/
I was wondering if similar benchmark numbers existed for even larger data sets, in the terabytes
if possible.

Also, are there any for just raw spark, i.e. No shark?

Thanks,

-Matt Chetah


Mime
View raw message