spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Shark vs Impala
Date Mon, 23 Jun 2014 01:24:30 GMT
In this benchmark, the problem wasn’t that Shark could not run without enough memory; Shark
spills some of the data to disk and can run just fine. The issue was that the in-memory form
of the RDDs was larger than the cluster’s memory, although the raw Parquet / ORC files did
fit in memory, so Cloudera did not want to run an “RDD” number where some of the RDD is
not in memory. But the wording “could not complete” is confusing — the queries complete
just fine.

We do plan to update the AMPLab benchmark with Spark SQL as well, and expand it to include
more of TPC-DS.


On Jun 22, 2014, at 9:53 AM, Debasish Das <> wrote:

> 600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark...
> Is it like SSDs or something that's helping redshift or the whole data is in memory when
you run the query ? Could you publish the query ?
> Also after spark-sql are we planning to add spark-sql runtimes in the amplab benchmark
as well ?
> On Sun, Jun 22, 2014 at 9:13 AM, Toby Douglass <> wrote:
> I've just benchmarked Spark and Impala.  Same data (in s3), same query, same cluster.
> Impala has a long load time, since it cannot load directly from s3.  I have to create
a Hive table on s3, then insert from that to an Impala table.  This takes a long time; Spark
took about 600s for the query, Impala 250s, but Impala required 6k seconds to load data from
s3.  If you're going to go the long-initial-load-then-quick-queries route, go for Redshift.
 On equivalent hardware, that took about 4k seconds to load, but then queries are like 5s

View raw message