spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Hayden <charles.hay...@atigeo.com>
Subject Re: how to set random seed
Date Thu, 14 May 2015 13:21:11 GMT
Thanks for the reply.


I have not tried it out (I will today and report on my results) but I think what I need to
do is to call mapPartitions and pass it a function that sets the seed.  I was planning to
pass the seed value in the closure.


Something like:

my_seed = 42
def f(iterator):
    random.seed(my_seed)
    yield my_seed
rdd.mapPartitions(f)


________________________________
From: ayan guha <guha.ayan@gmail.com>
Sent: Thursday, May 14, 2015 2:29 AM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed

Sorry for late reply.

Here is what I was thinking....

import random as r
def main():
....get SparkContext
#Just for fun, lets assume seed is an id
    filename="bin.dat"
    seed = id(filename)
#broadcast it
br = sc.broadcast(seed)

#set up dummy list
    lst = []
    for i in range(4):
        x=[]
        for j in range(4):
            x.append(j)
        lst.append(x)
    print lst
    base = sc.parallelize(lst)
    print base.map(randomize).collect()

Randomize looks like
def randomize(lst):
    local_seed = br.value
    r.seed(local_seed)
    r.shuffle(lst)
    return lst


Let me know if this helps...




base = sc.parallelize(lst)
    print base.map(randomize).collect()

On Wed, May 13, 2015 at 11:41 PM, Charles Hayden <charles.hayden@atigeo.com<mailto:charles.hayden@atigeo.com>>
wrote:

?Can you elaborate? Broadcast will distribute the seed, which is only one number.  But what
construct do I use to "plant" the seed (call random.seed()) once on each worker?

________________________________
From: ayan guha <guha.ayan@gmail.com<mailto:guha.ayan@gmail.com>>
Sent: Tuesday, May 12, 2015 11:17 PM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed


Easiest way is to broadcast it.

On 13 May 2015 10:40, "Charles Hayden" <charles.hayden@atigeo.com<mailto:charles.hayden@atigeo.com>>
wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will produce the
same shuffle.
I am looking for a way to set this same random seed once on each worker.  Is there any simple
way to do it??




--
Best Regards,
Ayan Guha

Mime
View raw message