colStats() computes the mean values along with several other summary
statistics, which makes it slower. How is the performance if you don't
use kryo? Xiangrui
On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar <rokroskar@gmail.com> wrote:
> thanks for the suggestion  however, looks like this is even slower. With
> the small data set I'm using, my aggregate function takes ~ 9 seconds and
> the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
> the Kyro serializer  I get the error:
>
> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
> required: 8
>
> is there an easy/obvious fix?
>
>
> On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>> There is some serialization overhead. You can try
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
>> . Xiangrui
>>
>> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokroskar@gmail.com> wrote:
>> > I have an RDD of SparseVectors and I'd like to calculate the means
>> > returning
>> > a dense vector. I've tried doing this with the following (using pyspark,
>> > spark v1.2.0):
>> >
>> > def aggregate_partition_values(vec1, vec2) :
>> > vec1[vec2.indices] += vec2.values
>> > return vec1
>> >
>> > def aggregate_combined_vectors(vec1, vec2) :
>> > if all(vec1 == vec2) :
>> > # then the vector came from only one partition
>> > return vec1
>> > else:
>> > return vec1 + vec2
>> >
>> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
>> > aggregate_combined_vectors)
>> > means = means / nvals
>> >
>> > This turns out to be really slow  and doesn't seem to depend on how
>> > many
>> > vectors there are so there seems to be some overhead somewhere that I'm
>> > not
>> > understanding. Is there a better way of doing this?
>> >
>> >
>> >
>> > 
>> > View this message in context:
>> > http://apachesparkuserlist.1001560.n3.nabble.com/calculatingthemeanofSparseVectorRDDtp21019.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > 
>> > To unsubscribe, email: userunsubscribe@spark.apache.org
>> > For additional commands, email: userhelp@spark.apache.org
>> >
>
>

To unsubscribe, email: userunsubscribe@spark.apache.org
For additional commands, email: userhelp@spark.apache.org
