spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 夏俊鸾 <>
Subject About Join operator in PySpark
Date Wed, 12 Nov 2014 07:31:19 GMT
Hi all

    I have noticed that “Join” operator has been transferred to union and
groupByKey operator instead of cogroup operator in PySpark, this change
will probably generate more shuffle stage, for example

    rdd1 = sc.makeRDD(...).partitionBy(2)
    rdd2 = sc.makeRDD(...).partitionBy(2)
    rdd3 = rdd1.join().collect()

    Above code implemented with scala will generate 2 shuffle, but will
generate 3 shuffle with PySpark. what is initial design motivation of join
operator in PySpark? Any idea to improve join performance in PySpark?


View raw message