spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksander Eskilson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
Date Fri, 14 Oct 2016 15:04:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575599#comment-15575599
] 

Aleksander Eskilson commented on SPARK-17939:
---------------------------------------------

[~marmbrus] suggested the opening of this issue after a bit of discussion in the email list
[1]. 

I'd like to clarify what I proposed in this newer context. First, clarification of what nullability
means in the current API, like in the GitHub issue linked above would be great. Second, it
makes total sense for default nullability in the reader in many instances, to support more
loosely-typed data sources, like JSON and CSV. 

However, apart from an analysis-time hint to the Catalyst optimizer, there are instances where
a (potentially separate?), enforcement-level idea of nullability would be quite useful. With
the possibility of now writing custom-encoders open, other kinds of more strongly-typed data
might be read into Datasets, e.g. Avro. Avro's UNION type with NULL gives us a harder notion
of a truly nullable vs. non-nullable type. It was suggested in the other linked Jira issue
above that the current contract is for users to make sure that they do not pass bad data into
the reader (as it currently performs conversions that might surprise the user, like from null
to 0). 

What I mean to suggest is that a type-level notion of nullability could help us fail faster
and abide by our own data-contracts when we have data to read into Datasets that comes from
more strongly-typed sources with known schemas. 

Thoughts on this?

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> ------------------------------------------------------------------
>
>                 Key: SPARK-17939
>                 URL: https://issues.apache.org/jira/browse/SPARK-17939
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Aleksander Eskilson
>            Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets creates some
confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer,
and is not meant to be a type-level enforcement. Allowing null fields can also help the reader
successfully parse certain types of more loosely-typed data, like JSON and CSV, where null
values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the API, but
also some requests for a (perhaps completely separate) type-level implementation of Nullable
that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message