spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <malouf.g...@gmail.com>
Subject Re: Regarding tooling/performance vs RedShift
Date Wed, 06 Aug 2014 19:53:12 GMT
Forgot to cc the mailing list :)


On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) <
R.Daniel@elsevier.com> wrote:

>  Agreed. Being able to use SQL to make a table, pass it to a graph
> algorithm, pass that output to a machine learning algorithm, being able to
> invoke user defined python functions, … are capabilities that far exceed
> what we can do with Redshift. The total performance will be much better,
> and the programmer productivity will be much better, even if the SQL
> portion is not quite as fast.  Mostly I was just objecting to " Redshift
> does very well, but Shark is on par or better than it in most of the tests
> " when that was not how I read the results, and Redshift was on HDDs.
>
>
>
> BTW – What are you doing w/ Spark? We have a lot of text and other content
> that we want to mine, and are shifting onto Spark so we have the greater
> capabilities mentioned above.
>
>
>
>
>
> Best regards,
>
>
>
> Ron Daniel, Jr.
>
> Director, Elsevier Labs
>
> r.daniel@elsevier.com
>
> mobile: +1 619 208 3064
>
>
>
>
>
>
>
> *From:* Gary Malouf [mailto:malouf.gary@gmail.com]
> *Sent:* Wednesday, August 06, 2014 12:35 PM
> *To:* Daniel, Ronald (ELS-SDG)
>
> *Subject:* Re: Regarding tooling/performance vs RedShift
>
>
>
> Hi Ronald,
>
>
>
> In my opinion, the performance just has to be 'close' to make that piece
> irrelevant.  I think the real issue comes down to tooling and the ease of
> connecting their various python tools from the office to results coming out
> of Spark/other solution in 'the cloud'.
>
>
>
>
>
> On Wed, Aug 6, 2014 at 1:43 PM, Daniel, Ronald (ELS-SDG) <
> R.Daniel@elsevier.com> wrote:
>
> Just to point out that the benchmark you point to has Redshift running on
> HDD machines instead of SSD, and it is still faster than Shark in all but
> one case.
>
>
>
> Like Gary, I'm also interested in replacing something we have on Redshift
> with Spark SQL, as it will give me much greater capability to process
> things. I'm willing to sacrifice some performance for the greater
> capability. But it would be nice to see the benchmark updated with Spark
> SQL, and with a more competitive configuration of Redshift.
>
>
>
> Best regards, and keep up the great work!
>
>
>
> Ron
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.chammas@gmail.com]
> *Sent:* Wednesday, August 06, 2014 9:30 AM
> *To:* Gary Malouf
> *Cc:* user
>
>
> *Subject:* Re: Regarding tooling/performance vs RedShift
>
>
>
> 1) We get tooling out of the box from RedShift (specifically, stable JDBC
> access) - Spark we often are waiting for devops to get the right combo of
> tools working or for libraries to support sequence files.
>
>
>
> The arguments about JDBC access and simpler setup definitely make sense.
> My first non-trivial Spark application was actually an ETL process that
> sliced and diced JSON + tabular data and then loaded it into Redshift. From
> there on you got all the benefits of your average C-store database, plus
> the added benefit of Amazon managing many annoying setup and admin details
> for your Redshift cluster.
>
>
>
> One area I'm looking forward to seeing Spark SQL excel at is offering fast
> JDBC access to "raw" data--i.e. directly against S3 / HDFS; no ETL
> required. For easy and flexible data exploration, I don't think you can
> beat that with a C-store that you have to ETL stuff into.
>
>
>
> 2) There is a belief that for many of our queries (assumed to often be
> joins) a columnar database will perform orders of magnitude better.
>
>
>
> This is definitely a "it depends" statement, but there is a detailed
> benchmark here <https://amplab.cs.berkeley.edu/benchmark/> comparing
> Shark, Redshift, and other systems. Have you seen it? Redshift does very
> well, but Shark is on par or better than it in most of the tests. Of
> course, going forward we'll want to see Spark SQL match this kind of
> performance, and that remains to be seen.
>
>
>
> Nick
>
>
>
>
>
> On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf <malouf.gary@gmail.com>
> wrote:
>
> My company is leaning towards moving much of their analytics work from our
> own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
> the internal advocate for using Spark for analytics, but a number of good
> points have been brought up to me.  The reasons being pushed are:
>
>
>
> - RedShift exposes a jdbc interface out of the box (no devops work there)
> and data looks and feels like it is in a normal sql database.  They want
> this out of the box from Spark, no trying to figure out which version
> matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
> theoretically supports this but there have been release issues our team has
> battled to date that erode the trust.
>
>
>
> - Complaints around challenges we have faced running a spark shell locally
> against a cluster in EC2.  It is partly a devops issue of deploying the
> correct configurations to local machines, being able to kick a user off
> hogging RAM, etc.
>
>
>
> - "I want to be able to run queries from my python shell against your
> sequence file data, roll it up and in the same shell leverage python graph
> tools."  - I'm not very familiar with the Python setup, but I believe by
> being able to run locally AND somehow add custom libraries to be accessed
> from PySpark this could be done.
>
>
>
> - "Joins will perform much better (in RedShift) because it says it sorts
> it's keys.  We cannot pre-compute all joins away."
>
>
>
>
>
> Basically, their argument is two-fold:
>
>
>
> 1) We get tooling out of the box from RedShift (specifically, stable JDBC
> access) - Spark we often are waiting for devops to get the right combo of
> tools working or for libraries to support sequence files.
>
>
>
> 2) There is a belief that for many of our queries (assumed to often be
> joins) a columnar database will perform orders of magnitude better.
>
>
>
>
>
>
>
> Anyway, a test is being setup to compare the two on the performance side
> but from a tools perspective it's hard to counter the issues that are
> brought up.
>
>
>
>
>

Mime
View raw message