spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gengliang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-26744) Support schema validation in File Source V2
Date Fri, 01 Feb 2019 04:59:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gengliang Wang updated SPARK-26744:
-----------------------------------
    Description: 
The internal API supportDataType in FileFormat validates the output/input schema before task
execution starts. So that we can avoid launching read/write tasks which would fail. Also,
users can see clean error messages.

This PR is to implement the same internal API in the FileDataSourceV2 framework. Comparing
to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places:

1. Read path: the table schema is determined in TableProvider.getTable. The actual read schema
can be a subset of the table schema. This PR proposes to validate the actual read schema in
FileScan.
2. Write path: validate the actual output schema in FileWriteBuilder.

  was:
The method supportDataType in FileFormat helps to validate the output/input schema before
execution starts. So that we can avoid some invalid data source IO, and users can see clean
error messages.

This PR is to implement the same method in the FileDataSourceV2 framework. Comparing to FileFormat,
FileDataSourceV2 has multiple layers. The API is added in two places:

1. FileWriteBuilder: this is where we can get the actual write schema
2. FileScan: this is where we can get the actual read schema.


> Support schema validation in File Source V2
> -------------------------------------------
>
>                 Key: SPARK-26744
>                 URL: https://issues.apache.org/jira/browse/SPARK-26744
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> The internal API supportDataType in FileFormat validates the output/input schema before
task execution starts. So that we can avoid launching read/write tasks which would fail. Also,
users can see clean error messages.
> This PR is to implement the same internal API in the FileDataSourceV2 framework. Comparing
to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places:
> 1. Read path: the table schema is determined in TableProvider.getTable. The actual read
schema can be a subset of the table schema. This PR proposes to validate the actual read schema
in FileScan.
> 2. Write path: validate the actual output schema in FileWriteBuilder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message