spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: Surprising Spark SQL benchmark
Date Sat, 01 Nov 2014 09:00:21 GMT
Hi Key,

Thank you so much for your update!!
Look forward to the shared code from AMPLab.  As a member of the Spark community, I really
hope that I could help to run TPC-DS on SparkSQL.  At the moment, I am trying TPC-H 22 queries
on SparkSQL 1.1.0 +Hive 0.12, and Hive 0.13.1 respectively (waiting Spark 1.2).


On 1 Nov, 2014, at 3:51 am, Kay Ousterhout <> wrote:

> There's been an effort in the AMPLab at Berkeley to set up a shared
> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
> we do frequently in the lab to evaluate new research.  Based on this
> thread, it sounds like making this more widely-available is something that
> would be useful to folks for reproducing the results published by
> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
> list as soon as we're done.
> -Kay
> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
>> wrote:
>> I believe that benchmark has a pending certification on it. See
>> under "Process".
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>> Nick
>> 2014년 10월 31일 금요일, Steve Nunez<>님이 작성한
>>> To be fair, we (Spark community) haven’t been any better, for example
>> this
>>> benchmark:
>>> For which no details or code have been released to allow others to
>>> reproduce it. I would encourage anyone doing a Spark benchmark in future
>>> to avoid the stigma of vendor reported benchmarks and publish enough
>>> information and code to let others repeat the exercise easily.
>>>        - Steve
>>> On 10/31/14, 11:30, "Nicholas Chammas" <
>>> <javascript:;>> wrote:
>>>> Thanks for the response, Patrick.
>>>> I guess the key takeaways are 1) the tuning/config details are
>> everything
>>>> (they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>>>> not), and 3) reach out to the relevant devs before publishing (didn't
>>>> happen).
>>>> Probably key takeaways for any kind of benchmark, really...
>>>> Nick
>>>> 2014년 10월 31일 금요일, Patrick Wendell< <javascript:;>>님이
>>> 작성한 메시지:
>>>>> Hey Nick,
>>>>> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>>>>> developers when running this. It is really easy to make one system
>>>>> look better than others when you are running a benchmark yourself
>>>>> because tuning and sizing can lead to a 10X performance improvement.
>>>>> This benchmark doesn't share the mechanism in a reproducible way.
>>>>> There are a bunch of things that aren't clear here:
>>>>> 1. Spark SQL has optimized parquet features, were these turned on?
>>>>> 2. It doesn't mention computing statistics in Spark SQL, but it does
>>>>> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>>>>> small tables which can make a 10X difference in TPC-H.
>>>>> 3. For data larger than memory, Spark SQL often performs better if you
>>>>> don't call "cache", did they try this?
>>>>> Basically, a self-reported marketing benchmark like this that
>>>>> *shocker* concludes this vendor's solution is the best, is not
>>>>> particularly useful.
>>>>> If Citus data wants to run a credible benchmark, I'd invite them to
>>>>> directly involve Spark SQL developers in the future. Until then, I
>>>>> wouldn't give much credence to this or any other similar vendor
>>>>> benchmark.
>>>>> - Patrick
>>>>> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>>>>> < <javascript:;> <javascript:;>>
>>>>>> I know we don't want to be jumping at every benchmark someone posts
>>>>> out
>>>>>> there, but this one surprised me:
>>>>>> This benchmark has Spark SQL failing to complete several queries
>>>>> the
>>>>>> TPC-H benchmark. I don't understand much about the details of
>>>>> performing
>>>>>> benchmarks, but this was surprising.
>>>>>> Are these results expected?
>>>>>> Related HN discussion here:
>>>>>> Nick
>>> --
>>> NOTICE: This message is intended for the use of the individual or entity
>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>> immediately
>>> and delete it from your system. Thank You.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message