spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (Jira)" <>
Subject [jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
Date Wed, 04 Dec 2019 17:04:00 GMT


Ruslan Dautkhanov commented on SPARK-19842:

>From the design document 


This alternative proposes to use the KEY_CONSTRAINTS catalog table when Spark upgrates to
Hive 2.1. Therefore, this proposal will introduce a dependency on Hive metastore 2.1. 


It seems Spark 3.0 is moving towards Hive 2.1 which has FK support.. would it be possible
to add FKs and related optimizations to Spark 3.0 too? 



> Informational Referential Integrity Constraints Support in Spark
> ----------------------------------------------------------------
>                 Key: SPARK-19842
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Ioana Delaney
>            Priority: Major
>         Attachments: InformationalRIConstraints.doc
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key (referential
integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization
techniques that rely on referential integrity constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary
key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance.
Informational constraints are not enforced by the Spark SQL engine; rather, they are used
by Catalyst to optimize the query processing. They provide semantics information that allows
Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct
operations, and perform a number of other optimizations. Informational constraints are primarily
targeted to applications that load and analyze data that originated from a data warehouse.
For such applications, the conditions for a given constraint are known to be true, so the
constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, constraint validation,
and maintenance. The document shows many examples of query performance improvements that utilize
referential integrity constraints and can be implemented in Spark.
> Link to the google doc: [InformationalRIConstraints|]

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message