spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Monselise <>
Subject Fwd: Join with multiple conditions (In reference to SPARK-7197)
Date Tue, 25 Aug 2015 18:21:05 GMT
Hello All,

PySpark currently has two ways of performing a join: specifying a join
condition or column names.

I would like to perform a join using a list of columns that appear in both
the left and right DataFrames. I have created an example in this question
on Stack Overflow

Basically, I would like to do the following as specified in the
documentation in  /spark/python/pyspark/sql/ row 560 and
specify a list of column names:

>>> df.join(df4, ['name', 'age']).select(, df.age).collect()
However, this produces an error.

In JIRA issue SPARK-7197 <>,
it is mentioned that the syntax is actually different from the one
specified in the documentation for joining using a condition.

>>> cond = [ ==, df.age == df3.age] >>> df.join(df3, cond,
'outer').select(, df3.age).collect()
JIRA Issue:

a.join(b, (a.year==b.year) & (a.month==b.month), 'inner')

In other words. the join function cannot take a list.
I was wondering if you could also clarify what is the correct syntax for
providing a list of columns.


View raw message