# spark-user mailing list archives

##### Site index · List index
Message view
Top
From Alexey Grishchenko <programme...@gmail.com>
Subject Re: Any quick method to sample rdd based on one filed?
Date Fri, 28 Aug 2015 21:53:26 GMT
```In my opinion aggragate+flatMap would work faster as it would make less
passes through the data. Would work like this:

import random

def agg(x,y):
x += 1 if not y else 0
x += 1 if y else 0
return x

# Source data
rdd  = sc.parallelize(xrange(100000), 5)
rdd2 = rdd.map(lambda x: (x, random.choice([True, False]))).cache()

# Calculate counts for True and False
counts = rdd2.aggregate([0,0], agg, lambda x,y: (x+y,x+y))
# If filtering is needed
if counts*10 > counts:
# Calculate sampling ratio
prob0 = float(counts)/10.0 / float(counts)
# Filter falses
rdd2  = rdd2.flatMap(lambda x: [x] if (x or x and random.random()
< prob0) else [])

# Count True and False again for validation - falses should be 10% of trues
rdd2.aggregate([0,0], agg, lambda x,y: (x+y,x+y))

On Fri, Aug 28, 2015 at 6:39 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> Filter into true rdd and false rdd. Union true rdd and sample of false rdd.
> On Aug 28, 2015 2:57 AM, "Gavin Yue" <yue.yuanyuan@gmail.com> wrote:
>
>> Hey,
>>
>>
>> I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and
>> randomly keep some Boolean:false rows.  And hope in the final result, the
>> negative ones could be 10 times more than positive ones.
>>
>>
>> What would be most efficient way to do this?
>>
>> Thanks,
>>
>>
>>
>>

--
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: ProgrammerAG@gmail.com
web:   http://0x0fff.com

```
Mime
View raw message