spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rok <>
Subject calculating the mean of SparseVector RDD
Date Wed, 07 Jan 2015 17:42:34 GMT
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0): 

def aggregate_partition_values(vec1, vec2) :
    vec1[vec2.indices] += vec2.values
    return vec1

def aggregate_combined_vectors(vec1, vec2) : 
    if all(vec1 == vec2) : 
        # then the vector came from only one partition
        return vec1
        return vec1 + vec2

means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
means = means / nvals

This turns out to be really slow -- and doesn't seem to depend on how many
vectors there are so there seems to be some overhead somewhere that I'm not
understanding. Is there a better way of doing this? 

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message