spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: [SQL] Using HashPartitioner to distribute by column
Date Tue, 20 Jan 2015 20:44:47 GMT
First of all, even if the underlying dataset is partitioned as expected, 
a shuffle can’t be avoided. Because Spark SQL knows nothing about the 
underlying data distribution. However, this does reduce network IO.

You can prepare your data like this (say |CustomerCode| is a string 
field with ordinal 1):

|val  schemaRdd  =  sql(...)
val  schema  =  schemaRdd.schema
val  prepared  =  schemaRdd.keyBy(_.getString(1)).partitionBy(new  HashPartitioner(n)).values.applySchema(schema)
|

|n| should be equal to |spark.sql.shuffle.partitions|.

Cheng

On 1/19/15 7:44 AM, Mick Davies wrote:

> Is it possible to use a HashPartioner or something similar to distribute a
> SchemaRDDs data by the hash of a particular column or set of columns.
>
> Having done this I would then hope that GROUP BY could avoid shuffle
>
> E.g. set up a HashPartioner on CustomerCode field so that
>
> SELECT CustomerCode, SUM(Cost)
> FROM Orders
> GROUP BY CustomerCode
>
> would not need to shuffle.
>
> Cheers
> Mick
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
​

Mime
View raw message