spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: How to partition a SparkDataFrame using all distinct column values in sparkR
Date Wed, 03 Aug 2016 10:04:58 GMT
SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all rows of the
same column value go to the same partition, but it does not guarantee that each partition
contain only single column value.

Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R function to all groups
grouped by the column.

> On Jul 26, 2016, at 06:46, Neil Chang <iambbq@gmail.com> wrote:
> 
> Hi,
>   This is a question regarding SparkR in spark 2.0.
> 
> Given that I have a SparkDataFrame and I want to partition it using one column's values.
Each value corresponds to a partition, all rows that having the same column value shall go
to the same partition, no more no less. 
> 
>    Seems the function repartition() doesn't do this, I have 394 unique values, it just
partitions my DataFrame into 200. If I specify the numPartitions to 394, some mismatch happens.
> 
> Is it possible to do what I described in sparkR?
> GroupBy doesn't work with udf at all.
> 
> Or can we split the DataFrame into list of small ones first, if so, what can I use?
> 
> Thanks,
> Neil



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message