spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Morozov <fathers...@list.ru>
Subject Re: grouping by a partitioned key
Date Tue, 11 Aug 2015 22:27:11 GMT
Philip,

If all data per key are inside just one partition, then Spark will figure that out. Can you
guarantee that’s the case?
What is it you try to achieve? There might be another way for it, when you might be 100% sure
what’s happening.

You can print debugString or explain (for DataFrame) to see what’s happening under the hood.


On 12 Aug 2015, at 01:19, Philip Weaver <philip.weaver@gmail.com> wrote:

> If I have an RDD that happens to already be partitioned by a key, how efficient can I
expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around
between nodes, and simply will have a small amount of work just checking the partitions to
discover that it doesn't need to move anything around.
> 
> Now, what if we're talking about a parquet database created by using DataFrameWriter.partitionBy(...),
then will Spark SQL be smart when I group by a key that I'm already partitioned by?
> 
> - Philip
> 

Eugene Morozov
fathersson@list.ru





Mime
View raw message