spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Error joining dataframes
Date Wed, 18 May 2016 07:38:13 GMT
Ah, yes. `df_join` has the two `id`, so you need to select which id you use;

scala> :paste

// Entering paste mode (ctrl-D to finish)


val df1 = Seq((1, 0), (2, 0)).toDF("id", "A")

val df2 = Seq((2, 0), (3, 0)).toDF("id", "B")

val df3 = df1.join(df2, df1("id") === df2("id"), "outer")

df3.printSchema

df3.select(df1("id")).show


// Exiting paste mode, now interpreting.


root

 |-- id: integer (nullable = true)

 |-- A: integer (nullable = true)

 |-- id: integer (nullable = true)

 |-- B: integer (nullable = true)


+----+

|  id|

+----+

|   1|

|   2|

|null|

+----+



On Wed, May 18, 2016 at 4:29 PM, ram kumar <ramkumarrock3@gmail.com> wrote:

> When you register a temp table from the dataframe
>
> eg:
> var df_join = df1.join(df2, df1("id") === df2("id"), "outer")
> df_join.registerTempTable("test")
>
> sqlContext.sql("select * from test")
>
> +----+----+----+----+
>
> |  id|   A|  id|   B|
>
> +----+----+----+----+
>
> |   1|   0|null|null|
>
> |   2|   0|   2|   0|
>
> |null|null|   3|   0|
>
> +----+----+----+----+
>
>
> but, when you query the "id"
>
>
> sqlContext.sql("select id from test")
>
> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>
> On Wed, May 18, 2016 at 12:44 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
> wrote:
>
>> Look weird, seems spark-v1.5.x can accept the query.
>> What's the difference between the example and your query?
>>
>> ####
>>
>> Welcome to
>>
>>       ____              __
>>
>>      / __/__  ___ _____/ /__
>>
>>     _\ \/ _ \/ _ `/ __/  '_/
>>
>>    /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>>
>>       /_/
>>
>> scala> :paste
>>
>> // Entering paste mode (ctrl-D to finish)
>>
>> val df1 = Seq((1, 0), (2, 0)).toDF("id", "A")
>>
>> val df2 = Seq((2, 0), (3, 0)).toDF("id", "B")
>>
>> df1.join(df2, df1("id") === df2("id"), "outer").show
>>
>>
>> // Exiting paste mode, now interpreting.
>>
>>
>> +----+----+----+----+
>>
>> |  id|   A|  id|   B|
>>
>> +----+----+----+----+
>>
>> |   1|   0|null|null|
>>
>> |   2|   0|   2|   0|
>>
>> |null|null|   3|   0|
>>
>> +----+----+----+----+
>>
>>
>> df1: org.apache.spark.sql.DataFrame = [id: int, A: int]
>>
>> df2: org.apache.spark.sql.DataFrame = [id: int, B: int]
>>
>>
>>
>>
>>
>> On Wed, May 18, 2016 at 3:52 PM, ram kumar <ramkumarrock3@gmail.com>
>> wrote:
>>
>>> I tried
>>> df1.join(df2, df1("id") === df2("id"), "outer").show
>>>
>>> But there is a duplicate "id" and when I query the "id", I get
>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>>
>>> I am currently using spark 1.5.2.
>>> Is there any alternative way in 1.5
>>>
>>> Thanks
>>>
>>> On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro <
>>> linguin.m.s@gmail.com> wrote:
>>>
>>>> Also, you can pass the query that you'd like to use in spark-v1.6+;
>>>>
>>>> val df1 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "A")
>>>> val df2 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "B")
>>>> df1.join(df2, df1("id") === df2("id"), "outer").show
>>>>
>>>> // maropu
>>>>
>>>>
>>>> On Wed, May 18, 2016 at 3:29 PM, ram kumar <ramkumarrock3@gmail.com>
>>>> wrote:
>>>>
>>>>> If I run as
>>>>> val rs = s.join(t,"time_id").join(c,"channel_id")
>>>>>
>>>>> It takes as inner join.
>>>>>
>>>>>
>>>>> On Wed, May 18, 2016 at 2:31 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> pretty simple, a similar construct to tables projected as DF
>>>>>>
>>>>>> val c =
>>>>>> HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
>>>>>> val t =
>>>>>> HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
>>>>>> val rs = s.join(t,"time_id").join(c,"channel_id")
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 May 2016 at 21:52, Bijay Kumar Pathak <bkpathak@mtu.edu>
wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Try this one:
>>>>>>>
>>>>>>>
>>>>>>> df_join = df1.*join*(df2, 'Id', "fullouter")
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Bijay
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 17, 2016 at 9:39 AM, ram kumar <ramkumarrock3@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I tried to join two dataframe
>>>>>>>>
>>>>>>>> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>>>>>>>>
>>>>>>>> df_join.registerTempTable("join_test")
>>>>>>>>
>>>>>>>>
>>>>>>>> When querying "Id" from "join_test"
>>>>>>>>
>>>>>>>> 0: jdbc:hive2://> *select Id from join_test;*
>>>>>>>> *Error*: org.apache.spark.sql.AnalysisException: Reference
'Id' is
>>>>>>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7
>>>>>>>> (state=,code=0)
>>>>>>>> 0: jdbc:hive2://>
>>>>>>>>
>>>>>>>> Is there a way to merge the value of df1("Id") and df2("Id")
into
>>>>>>>> one "Id"
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message