calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Linan Zheng <lazh...@bu.edu>
Subject Re: Apache Calcite Spark Adaptor GSOC 2018 Linan Zheng
Date Wed, 21 Mar 2018 02:54:17 GMT
Thank you guys for your responses and what you said are all valid points
and great advices. As suggested by the JIRA ticket CALCITE-1737, the goal
of my GSOC proposal will be to support Spark's DataFrame/DataSet API in
Apache Calcite. However, this task makes less sense to me at this moment as
the purpose of DataFrame/DataSet API is to have a high level interface and
perform query optimization to obtain an efficient physical plan. This is
exactly what Calcite with Spark Adaptor has performed. Therefore,
supporting DataFrame/DataSet API in Calcite is essentially asking to chain
two optimizers together while they are supposed to be two optimizers you
can choose from. Also, as Alessandro mentioned, gluing two planners might
give undesired outcomes in reality. Hence, I really have a tough time
putting together a valid proposal for this GSOC project due to this issue.
I will greatly appreciate it if anyone can give out some clarifications
regarding this JIRA ticket.

Again, I am deeply grateful for your replies and your helpfulness.

Best,
Linan Zheng

On Sat, Mar 17, 2018 at 3:40 AM, Alessandro Solimando <
alessandro.solimando@gmail.com> wrote:

> In my experience, if the "native" optimizer cannot be turned off, it can
> "revert back" some optimizations when you submit your "optimized"
> program/SQL query to the engine.
>
> As Spark 2.X is concerned, I am not aware of any way to turn catalyst off,
> so if you have a different cost model and/or query planner you might easily
> end up with a different logical and/or physical plan than what you expect.
>
> In the "Calcite performance benchmark" discussion, started by Edmon
> Begoli, this
> fact is addressed, as he proposed to evaluate Calcite with/without the
> "native" optimizer, which makes a lot of sense to me and can lead to
> surprising results.
>
> My knowledge of catalyst internals is unfortunately pretty shallow, so I
> cannot tell to which extent this can be an issue, or if potential problems
> can be by-passed by using HINTS or similar techniques.
>
> If anyone knows more or have practical examples on the subject I would be
> very interested in hearing more.
>
> Best regards,
> Alessandro
>
> On 16 March 2018 at 22:35, Julian Hyde <jhyde.apache@gmail.com> wrote:
>
> > The purpose of Calcite’s Spark Adapter is to circumvent Spark SQL and
> > Catalyst entirely. Calcite parses the SQL, it optimizes it to create a
> > physical plan that uses Spark relational operators, then converts that
> plan
> > to a Spark program.
> >
> > If you want to use Spark SQL and Catalyst that’s totally fine, but don’t
> > use Calcite for those cases.
> >
> > Julian
> >
> >
> > > On Mar 16, 2018, at 11:44 AM, Linan Zheng <lazheng@bu.edu> wrote:
> > >
> > > Hi Everyone,
> > >
> > > My name is Linan Zheng and currently a senior CS student at Boston
> > > University. I am fascinated by the idea of adding Apache Spark's
> > > DataFrame/DataSet API support in Apache Calcite. Right now I am working
> > on
> > > the proposal which i hope that I can get some advice with. My question
> is
> > > that since Spark has implement the Catalyst query optimizer in its
> Spark
> > > SQL, how should I approach Catalyst's planning rules(logical and
> > physical)?
> > > And who should be in charge of the query optimization? Any advice and
> > > corrections will be much appreciated and thank you guys for reading
> this
> > > email.
> > >
> > > --
> > > Best Regard,
> > > Linan Zheng
> >
> >
>



-- 
Best Regard,
Linan Zheng

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message