spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Bhatnagar <aniket.bhatna...@gmail.com>
Subject Dataset API | Setting number of partitions during join/groupBy
Date Fri, 11 Nov 2016 17:22:14 GMT
Hi

I can't seem to find a way to pass number of partitions while join 2
Datasets or doing a groupBy operation on the Dataset. There is an option of
repartitioning the resultant Dataset but it's inefficient to repartition
after the Dataset has been joined/grouped into default number of
partitions. With RDD API, this was easy to do as the functions accepted a
numPartitions parameter. The only way to do this seems to be
sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but
this means that all join/groupBy operations going forward will have the
same number of partitions.

Thanks,
Aniket

Mime
View raw message