spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao REN <julien19890...@gmail.com>
Subject Re: RDD.subtract doesn't work
Date Fri, 13 Sep 2013 15:31:59 GMT
@Jason : Thank you for your code. It works fine.

@Mark : It's good to know about random number generation. Thanks for the
advice.

Still a question :

As subtract can be replaced by the Jason's code, what is the use case of
subtract, knowing that it is not a good way to partition data ?

Thank you.

Hao


On Fri, Sep 13, 2013 at 3:33 AM, Jason Lenderman <jslenderman@gmail.com>wrote:

>
> Yeah, I realized shortly after I sent that message that my use of map in
> that code was problematic. This is probably a bit better:
>
>
>   def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
> System.currentTimeMillis): (RDD[T], RDD[T]) = {
>     val rand = new java.util.Random(seed)
>     val partitionSeeds = data.partitions.map(partition => rand.nextLong)
>     val temp = data.mapPartitionsWithIndex((index, iter) => {
>       val partitionRand = new java.util.Random(partitionSeeds(index))
>       iter.map(x => (x, partitionRand.nextDouble))
>
>     })
>     (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
>   }
>
>
>
>


-- 
REN Hao

Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)

Computer Science

Etudiant à l'Université de Technologie de Compiègne (UTC)

Génie Informatique - Fouille de Données

Tel:  +33 06 14 54 57 24  /  +41 07 86 47 52 69

Mime
View raw message