spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Tajeldin EDU <alitedu1...@gmail.com>
Subject Re: How to distinguish columns when joining DataFrames with shared parent?
Date Wed, 21 Oct 2015 18:17:44 GMT
Furthermore, even adding aliasing as suggested by the warning doesn't seem to help either.
 Slight modification to example below:

> scala> val largeValues = df.filter('value >= 10).as("lv")

And just looking at the join results:
> scala> val j = smallValues
>   .join(largeValues, smallValues("key") === largeValues("key"))

scala> j.select($"value").show
This will throw an exception indicating that "value" is ambiguous (to be expected).

scala> j.select(smallValues("value")).show
This will show the left (small values) "values" column as expected.

scala> j.select(largeValues("value")).show
This will show the left (small values) "values" column (resolved to the wrong column)

scala> j.select(largeValues("lv.value")).show
This will show the left (small values) "values" column (resolved to the wrong column even
though we explicitly specified the alias and used the right hand df)

scala> j.select($"lv.value").show
Produces a cannot resolve 'lv.value' exception (so the lv alias is not preserved in the join
result).

Anyone know the appropriate way to use the aliases in DataFrame operations or is this a bug?
--
Ali


On Oct 20, 2015, at 5:23 PM, Isabelle Phan <nliphan@gmail.com> wrote:

> Hello,
> 
> When joining 2 DataFrames which originate from the same initial DataFrame, why can't
org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read?
> 
> Let me illustrate this question with a simple example (ran on Spark 1.5.1):
> 
> //my initial DataFrame
> scala> df
> res39: org.apache.spark.sql.DataFrame = [key: int, value: int]
> 
> scala> df.show
> +---+-----+
> |key|value|
> +---+-----+
> |  1|    1|
> |  1|   10|
> |  2|    3|
> |  3|   20|
> |  3|    5|
> |  4|   10|
> +---+-----+
> 
> 
> //2 children DataFrames
> scala> val smallValues = df.filter('value < 10)
> smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int]
> 
> scala> smallValues.show
> +---+-----+
> |key|value|
> +---+-----+
> |  1|    1|
> |  2|    3|
> |  3|    5|
> +---+-----+
> 
> 
> scala> val largeValues = df.filter('value >= 10)
> largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int]
> 
> scala> largeValues.show
> +---+-----+
> |key|value|
> +---+-----+
> |  1|   10|
> |  3|   20|
> |  4|   10|
> +---+-----+
> 
> 
> //Joining the children
> scala> smallValues
>   .join(largeValues, smallValues("key") === largeValues("key"))
>   .withColumn("diff", smallValues("value") - largeValues("value"))
>   .show
> 15/10/20 16:59:59 WARN Column: Constructing trivially true equals predicate, 'key#41
= key#41'. Perhaps you need to use aliases.
> +---+-----+---+-----+----+
> |key|value|key|value|diff|
> +---+-----+---+-----+----+
> |  1|    1|  1|   10|   0|
> |  3|    5|  3|   20|   0|
> +---+-----+---+-----+----+
> 
> 
> This last command issued a warning, but still executed the join correctly (rows with
key 2 and 4 don't appear in result set). However, the "diff" column is incorrect.
> 
> Is this a bug or am I missing something here?
> 
> 
> Thanks a lot for any input,
> 
> Isabelle


Mime
View raw message