spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Surprising Spark SQL benchmark
Date Thu, 06 Nov 2014 02:34:34 GMT
Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps Ethernet and 12
disks. For 100 TB of data, the intermediate data could fit in memory on this cluster, which
can make shuffle much faster than with intermediate data on SSDs. You can find the specs in
http://sortbenchmark.org/Yahoo2013Sort.pdf. It just takes effort to utilize modern machines
fully -- for instance the Yahoo! cluster had 1 TB/s network bandwidth, but only sorted data
at 0.02 TB/s. Systems optimized for sorting, like TritonSort (which also won this year's benchmark),
get much closer to full utilization.

Matei

> On Nov 5, 2014, at 4:10 PM, Reynold Xin <rxin@databricks.com> wrote:
> 
> Steve,
> 
> I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
> updated the blog post to actually include CPU / disk / network measures.
> You should see that in any measure that matters to this benchmark, the old
> 2100 node cluster is vastly superior. The data even fit in memory!
> 
> 
> 
> On Wed, Nov 5, 2014 at 4:07 PM, Steve Nunez <snunez@hortonworks.com> wrote:
> 
>> Nicholas,
>> 
>> I never doubted the authenticity of the benchmark, nor the results. What I
>> think could be better is an objective analysis of the results. That post
>> neglected to point out the significant differences in hardware those two
>> benchmarks were run on. It is bit like bragging you broke the world record
>> at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
>> mention that the last record was held by a 2001 Toyota Celica.
>> 
>> - Steve
>> 
>> 
>> From:  Nicholas Chammas <nicholas.chammas@gmail.com>
>> Date:  Wednesday, November 5, 2014 at 15:56
>> To:  Steve Nunez <snunez@hortonworks.com>
>> Cc:  Patrick Wendell <pwendell@gmail.com>, dev <dev@spark.apache.org>
>> Subject:  Re: Surprising Spark SQL benchmark
>> 
>>> Steve Nunez, I believe the information behind the links below should
>> address
>>> your concerns earlier about Databricks's submission to the Daytona Gray
>>> benchmark.
>>> 
>>> On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com>
>>> wrote:
>>>> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
>>>> <nicholas.chammas@gmail.com> wrote:
>>>> 
>>>>> I believe that benchmark has a pending certification on it. See
>>>>> http://sortbenchmark.org under "Process".
>>>> Regarding this comment, Reynold has just announced that this benchmark
>> is now
>>>> certified.
>>>> * Announcement:
>>>> 
>> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
>>>> arge-scale-sorting.html
>>>> * Updated benchmark results page: http://sortbenchmark.org/
>>>> * Paper detailing Spark cluster configuration for the benchmark:
>>>> http://sortbenchmark.org/ApacheSpark2014.pdf
>>>> Nick
>>>> 
>>>> ​
>>> 
>> 
>> 
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message