spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (Jira)" <>
Subject [jira] [Updated] (SPARK-28481) More expressions should extend NullIntolerant
Date Mon, 16 Mar 2020 22:54:07 GMT


Dongjoon Hyun updated SPARK-28481:
    Affects Version/s:     (was: 3.0.0)

> More expressions should extend NullIntolerant
> ---------------------------------------------
>                 Key: SPARK-28481
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Josh Rosen
>            Priority: Major
> SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring
{{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns
{{null}} when any of its input expressions are {{null}}.
> I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression
/ BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However,
only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're
missing out on the opportunity to infer certain types of non-null constraints: for example,
if we see a {{WHERE length\(x\) > 10}} condition then we know that the column {{x}} must
be non-null and can push this non-null filter down to our datasource scan.
> I can think of a few ways to fix this:
>  # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can
use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate
the process of identifying expressions which do not override the default {{eval}}.
>  # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract
base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method,
then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes.
>  ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}}
has a default implementation which calls {{sys.error}}.
>  ** This would negatively impact users who have implemented custom Catalyst expressions.
>  # Use runtime reflection to determine whether expressions are null-intolerant by virtue
of using one of the default null-intolerant {{eval}} implementations. We can then use this
in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{NullIntolerant}}
or (b) are null-intolerant according to the reflective check (which is basically just figuring
out which concrete implementation the {{eval}} method resolves to).
>  ** We only need to perform the reflection once _per-class_ and can cache the result
for the lifetime of the JVM, so the performance overheads would be pretty small (especially
compared to other non-cacheable reflection / traversal costs in Catalyst).
>  ** The downside is additional complexity in the code which pattern-matches / checks
for null-intolerance.
> Of these approaches, I'm currently leaning towards option 1 (semi-automated identification
and manual update of hundreds of expressions): if we go with that approach then we can perform
a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we
add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use
when reviewing PRs which add new Catalyst expressions. 
> /cc [~maropu] [~viirya]

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message