spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Spark SQL parser and DDL
Date Thu, 04 Oct 2018 15:56:17 GMT
Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the
proposed additions to drop, rename, and alter columns. The most recent
update I’ve added is to allow transformation functions in the PARTITION BY
clause to pass to v2 data sources. This allows sources like Iceberg to do
partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the
current logical plans and includes details that are specific to them. For
example, data source table creation makes determinations like the EXTERNAL
keyword is not allowed and instead the mode (external or managed) is set
depending on whether a path is set. It also translates IF NOT EXISTS into a
SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the
parser to v2 plans requires interpreting these alterations and not the
original SQL. Another consequence is that there are two parsers: AstBuilder
in spark-catalyst and SparkSqlParser in spark-sql (core) because not all of
the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans
that carry the SQL options as they were parsed, and then convert those
plans to specific implementations depending on the tables that are used.
That makes support for v2 plans much cleaner by converting from a generic
SQL plan instead of creating a v1 plan that assumes a data source table and
then converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY
transformations. Instead of needing to add transformations to the
CatalogTable metadata that’s used everywhere, this only required a change
to the rule that converts from the parsed SQL plan to CatalogTable-based v1
plans. It is also cleaner to have the logic for converting to CatalogTable
in DataSourceAnalysis instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?
-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message