spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Rudenko <petro.rude...@gmail.com>
Subject Re: Dataframe random permutation?
Date Mon, 01 Jun 2015 20:53:16 GMT
Hi Cesar,
try to do:

hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) It's a bit inefficient,
but should shuffle the whole dataframe.

Thanks,
Peter Rudenko
On 2015-06-01 22:49, Cesar Flores wrote:
>
> I would like to know what will be the best approach to randomly 
> permute a Data Frame. I have tried:
>
> df.sample(false,1.0,x).show(100)
>
> where x is the seed. However, it gives the same result no matter the 
> value of x (it only gives different values when the fraction is 
> smaller than 1.0) . I have tried also:
>
> hc.createDataFrame(df.rdd.repartition(100),df.schema)
>
> which appears to be a random permutation. Can some one confirm me that 
> the last line is in fact a random permutation, or point me out to a 
> better approach?
>
>
> Thanks!!!!
> -- 
> Cesar Flores


Mime
View raw message