spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Regarding tooling/performance vs RedShift
Date Wed, 06 Aug 2014 16:29:43 GMT
>
> 1) We get tooling out of the box from RedShift (specifically, stable JDBC
> access) - Spark we often are waiting for devops to get the right combo of
> tools working or for libraries to support sequence files.


The arguments about JDBC access and simpler setup definitely make sense. My
first non-trivial Spark application was actually an ETL process that sliced
and diced JSON + tabular data and then loaded it into Redshift. From there
on you got all the benefits of your average C-store database, plus the
added benefit of Amazon managing many annoying setup and admin details for
your Redshift cluster.

One area I'm looking forward to seeing Spark SQL excel at is offering fast
JDBC access to "raw" data--i.e. directly against S3 / HDFS; no ETL
required. For easy and flexible data exploration, I don't think you can
beat that with a C-store that you have to ETL stuff into.

2) There is a belief that for many of our queries (assumed to often be
> joins) a columnar database will perform orders of magnitude better.


This is definitely a "it depends" statement, but there is a detailed
benchmark here <https://amplab.cs.berkeley.edu/benchmark/> comparing Shark,
Redshift, and other systems. Have you seen it? Redshift does very well, but
Shark is on par or better than it in most of the tests. Of course, going
forward we'll want to see Spark SQL match this kind of performance, and
that remains to be seen.

Nick



On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf <malouf.gary@gmail.com> wrote:

> My company is leaning towards moving much of their analytics work from our
> own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
> the internal advocate for using Spark for analytics, but a number of good
> points have been brought up to me.  The reasons being pushed are:
>
> - RedShift exposes a jdbc interface out of the box (no devops work there)
> and data looks and feels like it is in a normal sql database.  They want
> this out of the box from Spark, no trying to figure out which version
> matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
> theoretically supports this but there have been release issues our team has
> battled to date that erode the trust.
>
> - Complaints around challenges we have faced running a spark shell locally
> against a cluster in EC2.  It is partly a devops issue of deploying the
> correct configurations to local machines, being able to kick a user off
> hogging RAM, etc.
>
> - "I want to be able to run queries from my python shell against your
> sequence file data, roll it up and in the same shell leverage python graph
> tools."  - I'm not very familiar with the Python setup, but I believe by
> being able to run locally AND somehow add custom libraries to be accessed
> from PySpark this could be done.
>
> - "Joins will perform much better (in RedShift) because it says it sorts
> it's keys.  We cannot pre-compute all joins away."
>
>
> Basically, their argument is two-fold:
>
> 1) We get tooling out of the box from RedShift (specifically, stable JDBC
> access) - Spark we often are waiting for devops to get the right combo of
> tools working or for libraries to support sequence files.
>
> 2) There is a belief that for many of our queries (assumed to often be
> joins) a columnar database will perform orders of magnitude better.
>
>
>
> Anyway, a test is being setup to compare the two on the performance side
> but from a tools perspective it's hard to counter the issues that are
> brought up.
>

Mime
View raw message