spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ram kumar <ramkumarro...@gmail.com>
Subject Re: Error joining dataframes
Date Wed, 18 May 2016 07:29:48 GMT
When you register a temp table from the dataframe

eg:
var df_join = df1.join(df2, df1("id") === df2("id"), "outer")
df_join.registerTempTable("test")

sqlContext.sql("select * from test")

+----+----+----+----+

|  id|   A|  id|   B|

+----+----+----+----+

|   1|   0|null|null|

|   2|   0|   2|   0|

|null|null|   3|   0|

+----+----+----+----+


but, when you query the "id"


sqlContext.sql("select id from test")

*Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
*ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)

On Wed, May 18, 2016 at 12:44 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:

> Look weird, seems spark-v1.5.x can accept the query.
> What's the difference between the example and your query?
>
> ####
>
> Welcome to
>
>       ____              __
>
>      / __/__  ___ _____/ /__
>
>     _\ \/ _ \/ _ `/ __/  '_/
>
>    /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>
>       /_/
>
> scala> :paste
>
> // Entering paste mode (ctrl-D to finish)
>
> val df1 = Seq((1, 0), (2, 0)).toDF("id", "A")
>
> val df2 = Seq((2, 0), (3, 0)).toDF("id", "B")
>
> df1.join(df2, df1("id") === df2("id"), "outer").show
>
>
> // Exiting paste mode, now interpreting.
>
>
> +----+----+----+----+
>
> |  id|   A|  id|   B|
>
> +----+----+----+----+
>
> |   1|   0|null|null|
>
> |   2|   0|   2|   0|
>
> |null|null|   3|   0|
>
> +----+----+----+----+
>
>
> df1: org.apache.spark.sql.DataFrame = [id: int, A: int]
>
> df2: org.apache.spark.sql.DataFrame = [id: int, B: int]
>
>
>
>
>
> On Wed, May 18, 2016 at 3:52 PM, ram kumar <ramkumarrock3@gmail.com>
> wrote:
>
>> I tried
>> df1.join(df2, df1("id") === df2("id"), "outer").show
>>
>> But there is a duplicate "id" and when I query the "id", I get
>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>
>> I am currently using spark 1.5.2.
>> Is there any alternative way in 1.5
>>
>> Thanks
>>
>> On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro <linguin.m.s@gmail.com
>> > wrote:
>>
>>> Also, you can pass the query that you'd like to use in spark-v1.6+;
>>>
>>> val df1 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "A")
>>> val df2 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "B")
>>> df1.join(df2, df1("id") === df2("id"), "outer").show
>>>
>>> // maropu
>>>
>>>
>>> On Wed, May 18, 2016 at 3:29 PM, ram kumar <ramkumarrock3@gmail.com>
>>> wrote:
>>>
>>>> If I run as
>>>> val rs = s.join(t,"time_id").join(c,"channel_id")
>>>>
>>>> It takes as inner join.
>>>>
>>>>
>>>> On Wed, May 18, 2016 at 2:31 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> pretty simple, a similar construct to tables projected as DF
>>>>>
>>>>> val c =
>>>>> HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
>>>>> val t =
>>>>> HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
>>>>> val rs = s.join(t,"time_id").join(c,"channel_id")
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 17 May 2016 at 21:52, Bijay Kumar Pathak <bkpathak@mtu.edu>
wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Try this one:
>>>>>>
>>>>>>
>>>>>> df_join = df1.*join*(df2, 'Id', "fullouter")
>>>>>>
>>>>>> Thanks,
>>>>>> Bijay
>>>>>>
>>>>>>
>>>>>> On Tue, May 17, 2016 at 9:39 AM, ram kumar <ramkumarrock3@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I tried to join two dataframe
>>>>>>>
>>>>>>> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>>>>>>>
>>>>>>> df_join.registerTempTable("join_test")
>>>>>>>
>>>>>>>
>>>>>>> When querying "Id" from "join_test"
>>>>>>>
>>>>>>> 0: jdbc:hive2://> *select Id from join_test;*
>>>>>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id'
is
>>>>>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>>>>>> 0: jdbc:hive2://>
>>>>>>>
>>>>>>> Is there a way to merge the value of df1("Id") and df2("Id")
into
>>>>>>> one "Id"
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Mime
View raw message