spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dimitris plakas <dimitrisp...@gmail.com>
Subject Pyspark Partitioning
Date Thu, 04 Oct 2018 20:27:37 GMT
Hello everyone,

Here is an issue that i am facing in partitioning dtafarame.

I have a dataframe which called data_df. It is look like:

Group_Id | Object_Id | Trajectory
   1         |  obj1        | Traj1
   2         |  obj2        | Traj2
   1         |  obj3        | Traj3
   3         |  obj4        | Traj4
   2         |  obj5        | Traj5

This dataframe has 5045 rows where each row has value in Group_Id from 1 to
7, and the number of rows per group_id is arbitrary.
I want to split the rdd which produced by from this dataframe in 7
partitions one for each group_id and then apply mapPartitions() where i
call function custom_func(). How can i create these partitions from this
dataframe? Should i first apply group by (create the grouped_df) in order
to create a dataframe with 7 rows and then call
partitioned_rdd=grouped_df.rdd.mapPartitions()?
Which is the optimal way to do it?

Thank you in advance

Mime
View raw message