spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dale Wang <>
Subject [Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0
Date Fri, 02 Dec 2016 05:23:31 GMT
Hi all,

In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid
shuffle thus bringing us high join performance.

In the new Dataset API in Spark 2.0, is the high performance shuffle-free
join by co-partition mechanism still feasible? I have looked through the
API doc but failed. Will the Catalyst Optimizer handle the co-partition in
its query plan optimization process?

Thanks a lot if anyone can provide any clue on the problem :-)

Zhaokang(Dale) Wang

View raw message