spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Dimiduk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
Date Mon, 20 Feb 2017 23:43:44 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875137#comment-15875137
] 

Nick Dimiduk commented on SPARK-19615:
--------------------------------------

Thanks for taking a look [~hyukjin.kwon]. These three bugs are indeed issues -- in all cases,
it seems spark was not being careful to map column names to the appropriate column from each
site of the union. My experience with 1.6.3 and 2.1.0 with unions has been much better. Actually,
I still see echos of SPARK-9874 / SPARK-9813 when I extend one side or the other with null
columns. I can file that as a separate issue if that's of interest to you.

As for what RDBMS may or may not do, I'm not very aware or concerned. I'm thinking more about
ease of use for a user. This is why I suggest perhaps a different union method that would
encapsulate this behavior. Parsed spark sql can exhibit whatever semantics the community deems
appropriate, while still giving users of the API access to this convenient functionality.
I've implemented this logic in my application and it's quite complex. It would be very good
for Spark to provide this for its users.

> Provide Dataset union convenience for divergent schema
> ------------------------------------------------------
>
>                 Key: SPARK-19615
>                 URL: https://issues.apache.org/jira/browse/SPARK-19615
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema definitions is
surprisingly complex. Provide a version of the union method that will create a infer a target
schema as the result of merging the sources. Automatically add extend either side with {{null}}
columns for any missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message