spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Yip <yipjus...@prediction.io>
Subject Re: Best practice to avoid ambiguous columns in DataFrame.join
Date Fri, 15 May 2015 23:24:08 GMT
Thanks Michael,

This is very helpful. I have a follow up question related to NaFunctions.
Usually after a left outer join, we get lots of null value and we need to
handle them before further processing. I have the following piece of code,
the "_1" column is duplicated and crashes the .na.fill functions.

>From your answer, it appears that Spark 1.4 resolves this issue as only a
single "_1" column is outputted. You know if there is a good workaround for
Spark 1.3?

scala> df3.show
_1 a
1  a
2  b
3  b
4  b

scala> df4.show
_1 b
1  10
2  null
3  3
4  0

scala> df3.join(df4, df3("_1") === df4("_1")).na.fill(-999)
org.apache.spark.sql.AnalysisException: Reference '_1' is ambiguous, could
be: _1#33, _1#31.;
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:229)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:128)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
...

Thanks!

Justin



On Fri, May 15, 2015 at 3:55 PM, Michael Armbrust <michael@databricks.com>
wrote:

> There are several ways to solve this ambiguity:
>
> *1. use the DataFrames to get the attribute so its already "resolved" and
> not just a string we need to map to a DataFrame.*
>
> df.join(df2, df("_1") === df2("_1"))
>
> *2. Use aliases*
>
> df.as('a).join(df2.as('b), $"a._1" === $"b._1")
>
> *3. rename the columns as you suggested.*
>
> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" ===
> $"right_key").printSchema
>
> *4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String):
> DataFrame*
>
> df.join(df1, "_1")
>
> This has the added benefit of only outputting a single _1 column.
>
> On Fri, May 15, 2015 at 3:44 PM, Justin Yip <yipjustin@prediction.io>
> wrote:
>
>> Hello,
>>
>> I would like ask know if there are recommended ways of preventing
>> ambiguous columns when joining dataframes. When we join dataframes, it
>> usually happen we join the column with identical name. I could have rename
>> the columns on the right data frame, as described in the following code. Is
>> there a better way to achieve this?
>>
>> scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3,
>> "b"), (4, "b")))
>> df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>
>> scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3,
>> 30), (4, 40)))
>> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
>>
>> scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" ===
>> $"right_key").printSchema
>>
>> Thanks.
>>
>> Justin
>>
>> ------------------------------
>> View this message in context: Best practice to avoid ambiguous columns
>> in DataFrame.join
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Mime
View raw message