spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <hemant9...@gmail.com>
Subject Re: How to minimize shuffling on Spark dataframe Join?
Date Wed, 12 Aug 2015 05:02:20 GMT
Is the source of your dataframe partitioned on key? As per your mail, it
looks like it is not. If that is the case,  for partitioning the data, you
will have to shuffle the data anyway.

Another part of your question is - how to co-group data from two dataframes
based on a key? I think for RDD's cogroup in PairRDDFunctions is a way. I
am not sure if something similar is available for DataFrames.

Hemant





On Tue, Aug 11, 2015 at 2:14 PM, Abdullah Anwar <
abdullah.ibn.anwar@gmail.com> wrote:

>
>
> I have two dataframes like this
>
>   student_rdf = (studentid, name, ...)
>   student_result_rdf = (studentid, gpa, ...)
>
> we need to join this two dataframes. we are now doing like this,
>
> student_rdf.join(student_result_rdf, student_result_rdf["studentid"] == student_rdf["studentid"])
>
> So it is simple. But it creates lots of data shuffling across worker
> nodes, but as joining key is similar and if the dataframe could (understand
> the partitionkey) be partitioned using that key (studentid) then there
> suppose not to be any shuffling at all. As similar data (based on partition
> key) would reside in similar node. is it possible, to hint spark to do this?
>
> So, I am finding the way to partition data based on a column while I read
> a dataframe from input. And If it is possible that Spark would understand
> that two partitionkey of two dataframes are similar, then how?
>
>
>
>
> --
> Abdullah
>

Mime
View raw message