spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toby Douglass <t...@avocet.io>
Subject Re: Shark vs Impala
Date Mon, 23 Jun 2014 12:32:31 GMT
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

> Note that regarding a "long load time", data format means a whole lot in
> terms of query performance. If you load all your data into compressed,
> columnar Parquet files on local hardware, Spark SQL would also perform far,
> far better than it would reading from gzipped S3 files.
>

Yes.  We're comparing our particular use cases; if we used Spark, we'd like
to run from s3 from gzipped files for the sheer convenience of it.  Having
to pre-process data (which is the equivalent of the load phase with newSQL)
is a PITN.  One of the reasons for using post-Hadoop (rather than newSQL)
systems is to avoid this.


> You must also be careful about your queries; certain queries can be
> answered much more efficiently due to specific optimizations implemented in
> the query engine. For instance, Parquet keeps statistics. so you could
> theoretically do a count(*) over petabytes of data in less than a second,
> blowing away any competition that resorts to actually reading data.
>

Yes.  I posted the query just now.  The Redshifft table was only ordered by
timestamp, so in all cases the database should perform a single full table
scan.

Mime
View raw message