spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Dubovsky <spark.dubovsky.ja...@gmail.com>
Subject Number of partitions in Dataset aggregations
Date Wed, 01 Mar 2017 09:28:35 GMT
Hey all,

I know I can control the number of partitions to be used during Dataset
aggregations (groupBy, groupByKey, distinct, ...) by
spark.sql.shuffle.partitions configuraiton.

Is there any specific reason why Dataset api does not support passing
number of partitions explicitly to every call of relevant function
similarly as RDDs are doing that? This way it would much more flexible. If
there are more aggregations in an app (quite common) and they differ in how
computationally expensive they are (quite common) then we are forced to set
bigger #partitions for trivial aggregations as well.

Any thoughts or pointers to relevant design documents appreciated...

Thanks!

Jakub Dubovsky

Mime
View raw message