It’s probably not advisable to use 1 though since it will break when `df = df2`,
which can easily happen when you’ve written a function that does such a join internally.

This could be solved by an identity like function that returns the dataframe unchanged but with a different identity.
`.as` would be such a candidate but that doesn’t work.

Thoughts?

On 16 May 2015, at 00:55, Michael Armbrust <michael@databricks.com> wrote:

There are several ways to solve this ambiguity:

1. use the DataFrames to get the attribute so its already "resolved" and not just a string we need to map to a DataFrame.

df.join(df2, df("_1") === df2("_1"))

2. Use aliases

df.as('a).join(df2.as('b), $"a._1" === $"b._1")

3. rename the columns as you suggested.

df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === $"right_key").printSchema

4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): DataFrame

df.join(df1, "_1")

This has the added benefit of only outputting a single _1 column.

On Fri, May 15, 2015 at 3:44 PM, Justin Yip <yipjustin@prediction.io> wrote:
Hello,

I would like ask know if there are recommended ways of preventing ambiguous columns when joining dataframes. When we join dataframes, it usually happen we join the column with identical name. I could have rename the columns on the right data frame, as described in the following code. Is there a better way to achieve this? 

scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3, "b"), (4, "b")))
df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3, 30), (4, 40)))
df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]

scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === $"right_key").printSchema

Thanks.

Justin


View this message in context: Best practice to avoid ambiguous columns in DataFrame.join
Sent from the Apache Spark User List mailing list archive at Nabble.com.