spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumya Simanta <soumya.sima...@gmail.com>
Subject Re: SparkSQL performance
Date Sat, 01 Nov 2014 01:20:17 GMT
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li <lidu@yahoo-inc.com> wrote:

>   We have seen all kinds of results published that often contradict each
> other. My take is that the authors often know more tricks about how to tune
> their own/familiar products than the others. So the product on focus is
> tuned for ideal performance while the competitors are not. The authors are
> not necessarily biased but as a consequence the results are.
>
>  Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
>  Du
>
>
>   From: Soumya Simanta <soumya.simanta@gmail.com>
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: SparkSQL performance
>
>   I was really surprised to see the results here, esp. SparkSQL "not
> completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
>  I was under the impression that SparkSQL performs really well because it
> can optimize the RDD operations and load only the columns that are
> required. This essentially means in most cases SparkSQL should be as fast
> as Spark is.
>
>  I would be very interested to hear what others in the group have to say
> about this.
>
>  Thanks
> -Soumya
>
>
>

Mime
View raw message