spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <>
Subject Re: Spark v Redshift
Date Wed, 05 Nov 2014 02:45:02 GMT
>> We service templated queries from the appserver, i.e. user fills
>>out some forms, dropdowns: we translate to a query.


>>The target data
>>size is about a billion records, 20'ish fields, distributed throughout a
>>year (about 50GB on disk as CSV, uncompressed).

tells me that proprietary in memory app will be the best option for you.

I do not see any need for neither Spark nor Redshift in your case.

On Tue, Nov 4, 2014 at 5:41 PM, agfung <> wrote:

> Sounds like context would help, I just didn't want to subject people to a
> wall of text if it wasn't necessary :)
> Currently we use neither Spark SQL (or anything in the Hadoop stack) or
> Redshift.  We service templated queries from the appserver, i.e. user fills
> out some forms, dropdowns: we translate to a query.
> Data is "basically" one table containing thousands of independent time
> series, with one or two tables of reference data to join to.  e.g. median
> value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
> and T2 joining on a surrogate key, group by a different Field3.  The data
> structure is a little bit dynamic.  User can upload any CSV, as long as
> they
> tell us the name of each column and the programmatic type.  The target data
> size is about a billion records, 20'ish fields, distributed throughout a
> year (about 50GB on disk as CSV, uncompressed).
> So we're currently doing "historical" analytics (e.g. see analytic results
> of only yesterday's data or older, but want to see the result "quickly").
> We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
> the impact of new data on analytics "quickly").  Machine learning is also
> on
> the roadmap.
> One proposition is for Spark SQL as a complete replacement for Redshift.
> It
> would simplify the architecture, since our long term strategy is to handle
> data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
> other parts of the Hadoop family that would come into play for ETL is
> undetermined right now.  Spark SQL appears to have relational ability, and
> if we're going to use the Hadoop stack for ML and streaming analytics, and
> it has the ability, why not do it all on one stack and not shovel data
> around?  Also, lots of people talking about it.
> The other proposition is Redshift as the historical analytics solution, and
> something else (could be Spark, doesn't matter) for streaming analytics and
> ML.   If we need to relate the two, we'll have an API or process to stitch
> it together.   I've read about the "lambda architecture", which more or
> less
> describes this approach.  The motivation is Redshift has the AWS
> reliability/scalability/operational concerns worked out, richer query
> language (SQL and pgsql functions are designed for slice-n-dice analytics)
> so we can spend our coding time elsewhere, and a measure of safety against
> design issues and bugs: Spark just came out of incubator status this year,
> and it's much easier to find people on the web raving positively about
> Redshift in real-world usage (i.e. part of live, client-facing system) than
> Spark.
> category_theory's observation that most of the speed comes from fitting in
> memory is helpful.  It's what I would have surmised from the AMPLab Big
> Data
> benchmark, but confirmation from the hands-on community is invaluable,
> thank
> you.
> I understand a lot of it simply has to do with what-do-you-value-more
> weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
> sure if there were any other "key assumptions/requirements/gotchas" to
> consider.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message