spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Skyler Lehan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26739) Standardized Join Types for DataFrames
Date Thu, 28 Feb 2019 14:51:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780594#comment-16780594
] 

Skyler Lehan commented on SPARK-26739:
--------------------------------------

[~Francis47] yes, I didn't realize org.apache.spark.sql.catalyst.plans.joinTypes existed but
that would be ideal!

> Standardized Join Types for DataFrames
> --------------------------------------
>
>                 Key: SPARK-26739
>                 URL: https://issues.apache.org/jira/browse/SPARK-26739
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Skyler Lehan
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.
> Currently, in the join functions on [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
the join types are defined via a string parameter called joinType. In order for a developer
to know which joins are possible, they must look up the API call for join. While this works
fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected
errors that aren't evident at compile time. The objective of this improvement would be to
allow developers to use a common definition for join types (by enum or constants) called JoinTypes.
This would contain the possible joins and remove the possibility of a typo. It would also
allow Spark to alter the names of the joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything other than providing
a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading to runtime
errors
> h3. *Q4.* What is new in your approach and why do you think it will be successful?
> The new approach would use constants or *more preferably an enum*, something like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type does and
how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType string parameter
for users. If an enum is used, it would restrict the domain of possible join types to whatever
is defined in the future JoinType enum. The enum is preferred, however it would take longer
to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function easier to wield
and lead to less runtime errors. It will save time by bringing join type validation at compile
time. It will also provide in code reference to the join types, which saves the developer
time of having to look up and navigate the multiple join functions to find the possible join
types. In addition to that, the resulting constants/enum would have documentation on how that
join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types could be impacted
if an enum is chosen as the common definition. This risk can be mitigated by using string
constants. The string constants would be the exact same string as the string literals used
today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join functions
could be added that take in these enums and the join calls that contain string parameters
for joinType could be deprecated. This would give developers a chance to change over to the
new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types and additional
join functions that take in the join type enum/constant. The final exam would be working tests
written to check the functionality of these new join functions and the join functions that
take a string for joinType would be deprecated.
> h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any.
Backward and forward compatibility must be taken into account.
> {color:#FF0000}*It is heavily recommended that enums, and not string constants are used.*{color}
String constants are presented as a possible solution but not the ideal solution.
> *If enums are used (preferred):*
> The following join function signatures would be added to the Dataset API:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): DataFrame
> {code}
> The following functions would be deprecated:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
> {code}
> A new enum would be created called JoinType. Developers would be encouraged to adopt
using JoinType instead of the literal strings.
> *If string constants are used:*
> No current API changes, however a new Scala object with string constants would be defined
like so:
> {code:java}
> object JoinType {
>   final val INNER: String = "inner"
>   final val LEFT_OUTER: String = "left_outer"
> }
> {code}
> This approach would not allow for compile time checking of the join types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message