Hi Cheng, 

Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema)
then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle?

But the prepared will have already shuffled so we pay an upfront cost for future groupings (assuming we cache I suppose) 


On 20 Jan 2015, at 20:44, Cheng Lian <lian.cs.zju@gmail.com> wrote:

First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO.

You can prepare your data like this (say CustomerCode is a string field with ordinal 1):

val schemaRdd = sql(...)
val schema = schemaRdd.schema
val prepared = schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema)

n should be equal to spark.sql.shuffle.partitions.


On 1/19/15 7:44 AM, Mick Davies wrote:

Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.

Having done this I would then hope that GROUP BY could avoid shuffle

E.g. set up a HashPartioner on CustomerCode field so that 

SELECT CustomerCode, SUM(Cost)
FROM Orders
GROUP BY CustomerCode

would not need to shuffle.


View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org