spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lenderman <jslender...@gmail.com>
Subject Re: RDD.subtract doesn't work
Date Fri, 13 Sep 2013 16:24:01 GMT
You're welcome. Be sure to use the second version I posted as the first
version is problematic and could result in a bad (non-random) split under
some circumstances.


On Fri, Sep 13, 2013 at 8:31 AM, Hao REN <julien19890118@gmail.com> wrote:

> @Jason : Thank you for your code. It works fine.
>
> @Mark : It's good to know about random number generation. Thanks for the
> advice.
>
> Still a question :
>
> As subtract can be replaced by the Jason's code, what is the use case of
> subtract, knowing that it is not a good way to partition data ?
>
> Thank you.
>
> Hao
>
>
> On Fri, Sep 13, 2013 at 3:33 AM, Jason Lenderman <jslenderman@gmail.com>wrote:
>
>>
>> Yeah, I realized shortly after I sent that message that my use of map in
>> that code was problematic. This is probably a bit better:
>>
>>
>>   def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
>> System.currentTimeMillis): (RDD[T], RDD[T]) = {
>>     val rand = new java.util.Random(seed)
>>     val partitionSeeds = data.partitions.map(partition => rand.nextLong)
>>     val temp = data.mapPartitionsWithIndex((index, iter) => {
>>       val partitionRand = new java.util.Random(partitionSeeds(index))
>>       iter.map(x => (x, partitionRand.nextDouble))
>>
>>     })
>>     (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
>>   }
>>
>>
>>
>>
>
>
> --
> REN Hao
>
> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
>
> Computer Science
>
> Etudiant à l'Université de Technologie de Compiègne (UTC)
>
> Génie Informatique - Fouille de Données
>
> Tel:  +33 06 14 54 57 24  /  +41 07 86 47 52 69
>

Mime
View raw message