spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
Subject Re: Pyspark Partitioning
Date Thu, 04 Oct 2018 20:36:13 GMT
Groupby is an operator you would use if you wanted to *aggregate* the
values that are grouped by rhe specify key.

In your case you want to retain access to the values.

You need to do df.partitionBy and then you can map the partirions. Of
course you need to be carefull of potential skews in the resulting
partitions.

On Thu, Oct 4, 2018, 23:27 dimitris plakas <dimitrispl22@gmail.com> wrote:

> Hello everyone,
>
> Here is an issue that i am facing in partitioning dtafarame.
>
> I have a dataframe which called data_df. It is look like:
>
> Group_Id | Object_Id | Trajectory
>    1         |  obj1        | Traj1
>    2         |  obj2        | Traj2
>    1         |  obj3        | Traj3
>    3         |  obj4        | Traj4
>    2         |  obj5        | Traj5
>
> This dataframe has 5045 rows where each row has value in Group_Id from 1
> to 7, and the number of rows per group_id is arbitrary.
> I want to split the rdd which produced by from this dataframe in 7
> partitions one for each group_id and then apply mapPartitions() where i
> call function custom_func(). How can i create these partitions from this
> dataframe? Should i first apply group by (create the grouped_df) in order
> to create a dataframe with 7 rows and then call
> partitioned_rdd=grouped_df.rdd.mapPartitions()?
> Which is the optimal way to do it?
>
> Thank you in advance
>

Mime
View raw message