spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Axel Dahl <a...@whisperstream.com>
Subject Re: dataframe left joins are not working as expected in pyspark
Date Sat, 27 Jun 2015 16:28:15 GMT
I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
code would be failing right now.

On Saturday, June 27, 2015, Nicholas Chammas <nicholas.chammas@gmail.com>
wrote:

> Yeah, you shouldn't have to rename the columns before joining them.
>
> Do you see the same behavior on 1.3 vs 1.4?
>
> Nick
> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <axel@whisperstream.com
> <javascript:_e(%7B%7D,'cvml','axel@whisperstream.com');>>님이 작성:
>
>> still feels like a bug to have to create unique names before a join.
>>
>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.ayan@gmail.com
>> <javascript:_e(%7B%7D,'cvml','guha.ayan@gmail.com');>> wrote:
>>
>>> You can declare the schema with unique names before creation of df.
>>> On 27 Jun 2015 13:01, "Axel Dahl" <axel@whisperstream.com
>>> <javascript:_e(%7B%7D,'cvml','axel@whisperstream.com');>> wrote:
>>>
>>>>
>>>> I have the following code:
>>>>
>>>> from pyspark import SQLContext
>>>>
>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>>>
>>>> r1 = sc.parallelize(d1)
>>>> r2 = sc.parallelize(d2)
>>>>
>>>> sqlContext = SQLContext(sc)
>>>> df1 = sqlContext.createDataFrame(d1)
>>>> df2 = sqlContext.createDataFrame(d2)
>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>>> 'left_outer').collect()
>>>>
>>>>
>>>> When I run it I get the following, (notice in the first row, all join
>>>> keys are take from the right-side and so are blanked out):
>>>>
>>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>>> name=None),
>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>> name=u'bob'),
>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>> country=u'ire', name=u'alice')]
>>>>
>>>> I would expect to get (though ideally without duplicate columns):
>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>>>> name=None),
>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>> name=u'bob'),
>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>> country=u'ire', name=u'alice')]
>>>>
>>>> The workaround for now is this rather clunky piece of code:
>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>>> 'name2').withColumnRenamed('country', 'country2')
>>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>>>> 'left_outer').collect()
>>>>
>>>> So to me it looks like a bug, but am I doing something wrong?
>>>>
>>>> Thanks,
>>>>
>>>> -Axel
>>>>
>>>>
>>>>
>>>>
>>>>
>>

Mime
View raw message