spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enrico Minack (Jira)" <>
Subject [jira] [Created] (SPARK-30319) Adds a stricter version of as[T]
Date Fri, 20 Dec 2019 15:24:00 GMT
Enrico Minack created SPARK-30319:

             Summary: Adds a stricter version of as[T]
                 Key: SPARK-30319
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.4.4
            Reporter: Enrico Minack
             Fix For: 3.0.0

The behaviour of as[T] is not intuitive when you read code like[T].write.csv("data.csv").
The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic
to the schema of df. The expected behaviour is not provided elsewhere:
 * Extra columns that are not part of the type {{T}} are not dropped.
 * Order of columns is not aligned with schema of {{T}}.
 * Columns are not cast to the types of {{T}}'s fields. They have to be cast explicitly.

A method that enforces schema of T on a given Dataset would be very convenient and allows
to articulate and guarantee above assumptions about your data with the native Spark Dataset
API. This method plays a more explicit and enforcing role than as[T] with respect to columns,
column order and column type.

Possible naming of a stricter version of {{as[T]}}:
 * {{as[T](strict = true)}}
 * {{toDS[T]}} (as in {{toDF}})
 * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})

The naming {{toDS[T]}} is chosen here.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message