spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Davies <michael.belldav...@gmail.com>
Subject Re: [SQL] Using HashPartitioner to distribute by column
Date Wed, 21 Jan 2015 16:43:01 GMT
Hi Cheng, 

Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new
HashPartitioner(n)).values.applySchema(schema)
then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle?

But the prepared will have already shuffled so we pay an upfront cost for future groupings
(assuming we cache I suppose) 

Mick

> On 20 Jan 2015, at 20:44, Cheng Lian <lian.cs.zju@gmail.com> wrote:
> 
> First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t
be avoided. Because Spark SQL knows nothing about the underlying data distribution. However,
this does reduce network IO.
> 
> You can prepare your data like this (say CustomerCode is a string field with ordinal
1):
> 
> val schemaRdd = sql(...)
> val schema = schemaRdd.schema
> val prepared = schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema)
> n should be equal to spark.sql.shuffle.partitions.
> 
> Cheng
> 
> On 1/19/15 7:44 AM, Mick Davies wrote:
> 
> 
> 
>> Is it possible to use a HashPartioner or something similar to distribute a
>> SchemaRDDs data by the hash of a particular column or set of columns.
>> 
>> Having done this I would then hope that GROUP BY could avoid shuffle
>> 
>> E.g. set up a HashPartioner on CustomerCode field so that 
>> 
>> SELECT CustomerCode, SUM(Cost)
>> FROM Orders
>> GROUP BY CustomerCode
>> 
>> would not need to shuffle.
>> 
>> Cheers 
>> Mick
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
<http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark.apache.org>
>> For additional commands, e-mail: user-help@spark.apache.org <mailto:user-help@spark.apache.org>
>> 
>> 
> 
> 


Mime
View raw message