spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <...@adatao.com>
Subject Re: Benchmark numbers for terabytes of data
Date Wed, 04 Dec 2013 23:12:32 GMT
Matt, we've done 1TB linear models in 2-3 minutes on 40 node clusters
(30GB/node, just enough to hold all partitions simultaneously in memory).
You can do with fewer nodes if you're willing to slow things down.

Some of our TB benchmark numbers are available in my Spark Summit slides.
Sorry I'm on a plane now but you should be able to find the slides fairly
easily.

Re your other comment: monolithic 100-node analytic clusters are not
unusual, but not yet common outside of large companies. I'd eduguesstimate
it to be at the top 5%ile among companies with less than $500MM revenues,
with selection bias among Silicon Valley companies.

Sent while mobile. Pls excuse typos etc.
On Dec 4, 2013 11:06 AM, "Matt Cheah" <mcheah@palantir.com> wrote:

>  I'm reading the paper now, thanks. It states 100-node clusters were
> used. Is this typical in the field to have 100 node clusters for the 1TB
> scale? We were expecting to be using ~10 nodes.
>
>  I'm still pretty new to cluster computing, so just not sure how people
> have set these up.
>
>  -Matt Cheah
>
>   From: Matei Zaharia <matei.zaharia@gmail.com>
> Reply-To: "user@spark.incubator.apache.org" <
> user@spark.incubator.apache.org>
> Date: Wednesday, December 4, 2013 10:53 AM
> To: "user@spark.incubator.apache.org" <user@spark.incubator.apache.org>
> Cc: Mingyu Kim <mkim@palantir.com>
> Subject: Re: Benchmark numbers for terabytes of data
>
>   Yes, check out the Shark paper for example:
> https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
>
>  The numbers on that benchmark are for Shark.
>
>  Matei
>
>  On Dec 3, 2013, at 3:50 PM, Matt Cheah <mcheah@palantir.com> wrote:
>
>  Hi everyone,
>
>  I notice the benchmark page for AMPLab provides some numbers on Gbs of
> data: https://amplab.cs.berkeley.edu/benchmark/ I was wondering if
> similar benchmark numbers existed for even larger data sets, in the
> terabytes if possible.
>
>  Also, are there any for just raw spark, i.e. No shark?
>
>  Thanks,
>
>  -Matt Chetah
>
>
>

Mime
View raw message