spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick <titlibat...@gmail.com>
Subject Re: Shuffling on Dataframe to RDD conversion with a map transformation
Date Wed, 29 Mar 2017 02:24:59 GMT
Hi,

In the above query GroupBy used was creating a shuffle stage and by default
spark sql uses 200 partition for the shuffle stage.
This can be configured by spark.sql.shuffle.partitions.

By increasing the number of shuffle partitions, i was able to run this.

Thanks,


On Thu, Feb 23, 2017 at 9:15 PM, Yong Zhang <java8964@hotmail.com> wrote:

> It will be helpful if you print the execution plan for your query here.
>
>
> Yong
>
>
> ------------------------------
> *From:* Patrick <titlibatali@gmail.com>
> *Sent:* Thursday, February 23, 2017 9:21 AM
> *To:* user
> *Subject:* Shuffling on Dataframe to RDD conversion with a map
> transformation
>
> Hi,
>
> I was wondering why there is two stages shuffle write and shuffle read at
> line no 194,
> when i am converting Dataframe obtained by sql query to RDD which is
> causing job not to abort and it doesnt scale on TBs of data.
>
> Also, i have given shuffle fraction=0.6 and memory fraction=0.2 while
> executing the job.
>
>  val queryresult = hiveCtx.sql("select  Name,Age,Product1,Product2,Product3,count(*),
> max(Month)  from test GROUP BY Name,Age,Product1,Product2,Product3
> GROUPING SETS ((Name,Age,Product1),(Name,Age,Product2),(Name,Age,
> Product3))")
>
> line 194:
>
> val resultrdd = queryresult.map( x => (x.get(0),
> (x.get(1),x.get(2),x.get(3),x.get(4),x.get(5)))
>
>
>
> [image: Inline image 2]
>
> Any insights into the problem would be very helpful.
>
> Thanks
>
>

Mime
View raw message