spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: how to set random seed
Date Sun, 17 May 2015 08:12:35 GMT
The python workers used for each stage may be different, this may not
work as expected.

You can create a Random object, set the seed, use it to do the shuffle().

r  = random.Random()
r.seek(my_seed)

def f(x):
   r.shuffle(l)
rdd.map(f)

On Thu, May 14, 2015 at 6:21 AM, Charles Hayden
<charles.hayden@atigeo.com> wrote:
> Thanks for the reply.
>
>
> I have not tried it out (I will today and report on my results) but I think
> what I need to do is to call mapPartitions and pass it a function that sets
> the seed.  I was planning to pass the seed value in the closure.
>
>
> Something like:
>
> my_seed = 42
> def f(iterator):
>     random.seed(my_seed)
>     yield my_seed
> rdd.mapPartitions(f)
>
>
> ________________________________
> From: ayan guha <guha.ayan@gmail.com>
> Sent: Thursday, May 14, 2015 2:29 AM
>
> To: Charles Hayden
> Cc: user
> Subject: Re: how to set random seed
>
> Sorry for late reply.
>
> Here is what I was thinking....
>
> import random as r
> def main():
> ....get SparkContext
> #Just for fun, lets assume seed is an id
>     filename="bin.dat"
>     seed = id(filename)
> #broadcast it
> br = sc.broadcast(seed)
>
> #set up dummy list
>     lst = []
>     for i in range(4):
>         x=[]
>         for j in range(4):
>             x.append(j)
>         lst.append(x)
>     print lst
>     base = sc.parallelize(lst)
>     print base.map(randomize).collect()
>
> Randomize looks like
> def randomize(lst):
>     local_seed = br.value
>     r.seed(local_seed)
>     r.shuffle(lst)
>     return lst
>
>
> Let me know if this helps...
>
>
>
>
> base = sc.parallelize(lst)
>     print base.map(randomize).collect()
>
> On Wed, May 13, 2015 at 11:41 PM, Charles Hayden <charles.hayden@atigeo.com>
> wrote:
>>
>> Can you elaborate? Broadcast will distribute the seed, which is only one
>> number.  But what construct do I use to "plant" the seed (call
>> random.seed()) once on each worker?
>>
>> ________________________________
>> From: ayan guha <guha.ayan@gmail.com>
>> Sent: Tuesday, May 12, 2015 11:17 PM
>> To: Charles Hayden
>> Cc: user
>> Subject: Re: how to set random seed
>>
>>
>> Easiest way is to broadcast it.
>>
>> On 13 May 2015 10:40, "Charles Hayden" <charles.hayden@atigeo.com> wrote:
>>>
>>> In pySpark, I am writing a map with a lambda that calls random.shuffle.
>>> For testing, I want to be able to give it a seed, so that successive runs
>>> will produce the same shuffle.
>>> I am looking for a way to set this same random seed once on each worker.
>>> Is there any simple way to do it?
>>>
>>>
>
>
>
> --
> Best Regards,
> Ayan Guha

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message