These were EC2 clusters, so the machines were smaller than modern machines. You can definitely have 1 TB datasets on 10 nodes too. Actually if youíre curious about hardware configuration, take a look at

Also, regarding Spark vs Shark ó raw Spark code is usually faster than Shark, but we donít have as many recent benchmarks on large datasets. Some of the code running in the Shark paper is Spark-based though (specifically the machine learning algorithms).


On Dec 4, 2013, at 11:06 AM, Matt Cheah <> wrote:

I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes.

I'm still pretty new to cluster computing, so just not sure how people have set these up.

-Matt Cheah

From: Matei Zaharia <>
Reply-To: "" <>
Date: Wednesday, December 4, 2013 10:53 AM
To: "" <>
Cc: Mingyu Kim <>
Subject: Re: Benchmark numbers for terabytes of data

Yes, check out the Shark paper for example:

The numbers on that benchmark are for Shark.


On Dec 3, 2013, at 3:50 PM, Matt Cheah <> wrote:

Hi everyone,

I notice the benchmark page for AMPLab provides some numbers on Gbs of data: I was wondering if similar benchmark numbers existed for even larger data sets, in the terabytes if possible.

Also, are there any for just raw spark, i.e. No shark?


-Matt Chetah