spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chengi Liu <>
Subject sampling in spark
Date Tue, 28 Oct 2014 07:26:32 GMT
  I have three rdds.. X,y and p
X is matrix rdd (mXn), y is (mX1) dimension vector
and p is (mX1) dimension probability vector.
Now, I am trying to sample k rows from X and corresponding entries in y
based on probability vector p.
Here is the python implementation

import randomfrom bisect import bisectfrom operator import itemgetter

def sample(population, k, prob):

    def cdf(population, k, prob):
        population = map(itemgetter(1), sorted(zip(prob, population)))
        cumm = [prob[0]]
        for i in range(1, len(prob)):

            cumm.append(_cumm[-1] + prob[i])
        return [population[bisect(cumm, random.random())] for i in range(k)]

     return cdf(population, k, prob)

View raw message