spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: dataframe left joins are not working as expected in pyspark
Date Sat, 27 Jun 2015 18:29:52 GMT
I would test it against 1.3 to be sure, because it could -- though unlikely
-- be a regression. For example, I recently stumbled upon this issue
<https://issues.apache.org/jira/browse/SPARK-8670> which was specific to
1.4.

On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl <axel@whisperstream.com> wrote:

> I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
> code would be failing right now.
>
> On Saturday, June 27, 2015, Nicholas Chammas <nicholas.chammas@gmail.com>
> wrote:
>
>> Yeah, you shouldn't have to rename the columns before joining them.
>>
>> Do you see the same behavior on 1.3 vs 1.4?
>>
>> Nick
>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <axel@whisperstream.com>님이
작성:
>>
>>> still feels like a bug to have to create unique names before a join.
>>>
>>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>>> You can declare the schema with unique names before creation of df.
>>>> On 27 Jun 2015 13:01, "Axel Dahl" <axel@whisperstream.com> wrote:
>>>>
>>>>>
>>>>> I have the following code:
>>>>>
>>>>> from pyspark import SQLContext
>>>>>
>>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age':
3}]
>>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>>>>
>>>>> r1 = sc.parallelize(d1)
>>>>> r2 = sc.parallelize(d2)
>>>>>
>>>>> sqlContext = SQLContext(sc)
>>>>> df1 = sqlContext.createDataFrame(d1)
>>>>> df2 = sqlContext.createDataFrame(d2)
>>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>>>> 'left_outer').collect()
>>>>>
>>>>>
>>>>> When I run it I get the following, (notice in the first row, all join
>>>>> keys are take from the right-side and so are blanked out):
>>>>>
>>>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>>>> name=None),
>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>>> name=u'bob'),
>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>> country=u'ire', name=u'alice')]
>>>>>
>>>>> I would expect to get (though ideally without duplicate columns):
>>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>>>>> name=None),
>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>>> name=u'bob'),
>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>> country=u'ire', name=u'alice')]
>>>>>
>>>>> The workaround for now is this rather clunky piece of code:
>>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>>>> 'name2').withColumnRenamed('country', 'country2')
>>>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>>>>> 'left_outer').collect()
>>>>>
>>>>> So to me it looks like a bug, but am I doing something wrong?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Axel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>

Mime
View raw message