spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <>
Subject Re: Spark SQL value proposition in batch pipelines
Date Thu, 12 Feb 2015 17:29:55 GMT
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.

That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the predicates in the WHERE clause down to the base
data (even to external data sources if you have them), ordering joins, and
choosing between join implementations (like using broadcast joins instead
of the default shuffle-based hash join in RDD.join). These decisions can
make your queries run orders of magnitude faster than they would if you
implemented them using basic RDD transformations. The best part is at this
stage, I'd expect the optimizer will continue to improve - meaning many of
your queries will get faster with each new release.

I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
as you're working with multiple tables and doing fairly textbook SQL stuff
- you likely want the engine figuring this stuff out for you rather than
hand coding it yourself. That said - with Spark, you can always drop back
to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.

On Thu, Feb 12, 2015 at 8:56 AM, vha14 <> wrote:

> My team is building a batch data processing pipeline using Spark API and
> trying to understand if Spark SQL can help us. Below are what we found so
> far:
> - SQL's declarative style may be more readable in some cases (e.g. joining
> of more than two RDDs), although some devs prefer the fluent style
> regardless.
> - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> Spark SQL supports joining of arbitrary number of RDDs.
> - It seems that Spark SQL's features such as optimization based on
> predicate
> pushdown and dynamic schema inference are less applicable in a batch
> environment.
> Your inputs/suggestions are most welcome!
> Thanks,
> Vu Ha
> CTO, Semantic Scholar
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message