spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 汪洋 <tiandiwo...@icloud.com>
Subject Re: rdd.distinct with Partitioner
Date Thu, 09 Jun 2016 04:22:21 GMT
Hi Alexander,

I think it does not guarantee to be right if an arbitrary Partitioner is passed in.

I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>)

Best regards,

Yang


> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivovarov@gmail.com>
写道:
> 
> most of the RDD methods which shuffle data take Partitioner as a parameter
> 
> But rdd.distinct does not have such signature
> 
> Should I open a PR for that?
> 
> /**
>  * Return a new RDD containing the distinct elements in this RDD.
>  */
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope
{
>   map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }


Mime
View raw message