spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Benchmark numbers for terabytes of data
Date Wed, 04 Dec 2013 19:17:49 GMT
These were EC2 clusters, so the machines were smaller than modern machines. You can definitely
have 1 TB datasets on 10 nodes too. Actually if you’re curious about hardware configuration,
take a look at http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html.

Also, regarding Spark vs Shark — raw Spark code is usually faster than Shark, but we don’t
have as many recent benchmarks on large datasets. Some of the code running in the Shark paper
is Spark-based though (specifically the machine learning algorithms).

Matei

On Dec 4, 2013, at 11:06 AM, Matt Cheah <mcheah@palantir.com> wrote:

> I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical
in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10
nodes.
> 
> I'm still pretty new to cluster computing, so just not sure how people have set these
up.
> 
> -Matt Cheah
> 
> From: Matei Zaharia <matei.zaharia@gmail.com>
> Reply-To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Date: Wednesday, December 4, 2013 10:53 AM
> To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Cc: Mingyu Kim <mkim@palantir.com>
> Subject: Re: Benchmark numbers for terabytes of data
> 
> Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
> 
> The numbers on that benchmark are for Shark.
> 
> Matei
> 
> On Dec 3, 2013, at 3:50 PM, Matt Cheah <mcheah@palantir.com> wrote:
> 
>> Hi everyone,
>> 
>> I notice the benchmark page for AMPLab provides some numbers on Gbs of data: https://amplab.cs.berkeley.edu/benchmark/
I was wondering if similar benchmark numbers existed for even larger data sets, in the terabytes
if possible.
>> 
>> Also, are there any for just raw spark, i.e. No shark?
>> 
>> Thanks,
>> 
>> -Matt Chetah
> 


Mime
View raw message