spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdullah Anwar <>
Subject Fwd: How to minimize shuffling on Spark dataframe Join?
Date Tue, 11 Aug 2015 08:44:50 GMT
I have two dataframes like this

  student_rdf = (studentid, name, ...)
  student_result_rdf = (studentid, gpa, ...)

we need to join this two dataframes. we are now doing like this,

student_rdf.join(student_result_rdf, student_result_rdf["studentid"]
== student_rdf["studentid"])

So it is simple. But it creates lots of data shuffling across worker nodes,
but as joining key is similar and if the dataframe could (understand the
partitionkey) be partitioned using that key (studentid) then there suppose
not to be any shuffling at all. As similar data (based on partition key)
would reside in similar node. is it possible, to hint spark to do this?

So, I am finding the way to partition data based on a column while I read a
dataframe from input. And If it is possible that Spark would understand
that two partitionkey of two dataframes are similar, then how?


View raw message