spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: Spark SQL parser and DDL
Date Sun, 07 Oct 2018 21:24:29 GMT
Sounds like a good idea?

Would this be a step in the direction of supporting variation of the SQL dialect, too?


________________________________
From: Ryan Blue <rblue@netflix.com.invalid>
Sent: Thursday, October 4, 2018 8:56 AM
To: Spark Dev List
Subject: Spark SQL parser and DDL


Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the proposed additions
to drop, rename, and alter columns. The most recent update I’ve added is to allow transformation
functions in the PARTITION BY clause to pass to v2 data sources. This allows sources like
Iceberg to do partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the current logical plans
and includes details that are specific to them. For example, data source table creation makes
determinations like the EXTERNAL keyword is not allowed and instead the mode (external or
managed) is set depending on whether a path is set. It also translates IF NOT EXISTS into
a SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the parser to v2 plans
requires interpreting these alterations and not the original SQL. Another consequence is that
there are two parsers: AstBuilder in spark-catalyst and SparkSqlParser in spark-sql (core)
because not all of the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans that carry the SQL
options as they were parsed, and then convert those plans to specific implementations depending
on the tables that are used. That makes support for v2 plans much cleaner by converting from
a generic SQL plan instead of creating a v1 plan that assumes a data source table and then
converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY transformations. Instead
of needing to add transformations to the CatalogTable metadata that’s used everywhere, this
only required a change to the rule that converts from the parsed SQL plan to CatalogTable-based
v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis
instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?

--
Ryan Blue
Software Engineer
Netflix

Mime
View raw message