spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan-Paul Bultmann <janpaulbultm...@me.com>
Subject Re: Best practice to avoid ambiguous columns in DataFrame.join
Date Sun, 17 May 2015 16:31:32 GMT
It’s probably not advisable to use 1 though since it will break when `df = df2`,
which can easily happen when you’ve written a function that does such a join internally.

This could be solved by an identity like function that returns the dataframe unchanged but
with a different identity.
`.as` would be such a candidate but that doesn’t work.

Thoughts?

> On 16 May 2015, at 00:55, Michael Armbrust <michael@databricks.com <mailto:michael@databricks.com>>
wrote:
> 
> There are several ways to solve this ambiguity:
> 
> 1. use the DataFrames to get the attribute so its already "resolved" and not just a string
we need to map to a DataFrame.
> 
> df.join(df2, df("_1") === df2("_1"))
> 
> 2. Use aliases
> 
> df.as <http://df.as/>('a).join(df2.as <http://df2.as/>('b), $"a._1" === $"b._1")
> 
> 3. rename the columns as you suggested.
> 
> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === $"right_key").printSchema
> 
> 4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): DataFrame
> 
> df.join(df1, "_1")
> 
> This has the added benefit of only outputting a single _1 column.
> 
> On Fri, May 15, 2015 at 3:44 PM, Justin Yip <yipjustin@prediction.io <mailto:yipjustin@prediction.io>>
wrote:
> Hello,
> 
> I would like ask know if there are recommended ways of preventing ambiguous columns when
joining dataframes. When we join dataframes, it usually happen we join the column with identical
name. I could have rename the columns on the right data frame, as described in the following
code. Is there a better way to achieve this? 
> 
> scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3, "b"), (4, "b")))
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> 
> scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3, 30), (4, 40)))
> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
> 
> scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === $"right_key").printSchema
> 
> Thanks.
> 
> Justin
> 
> View this message in context: Best practice to avoid ambiguous columns in DataFrame.join
<http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html>
> Sent from the Apache Spark User List mailing list archive <http://apache-spark-user-list.1001560.n3.nabble.com/>
at Nabble.com <http://nabble.com/>.
> 


Mime
View raw message