spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mick Davies <Michael.BellDav...@gmail.com>
Subject [SQL] Using HashPartitioner to distribute by column
Date Mon, 19 Jan 2015 15:44:29 GMT
Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.

Having done this I would then hope that GROUP BY could avoid shuffle

E.g. set up a HashPartioner on CustomerCode field so that 

SELECT CustomerCode, SUM(Cost)
FROM Orders
GROUP BY CustomerCode

would not need to shuffle.

Cheers 
Mick





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message