spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rok Roskar <rokros...@gmail.com>
Subject Re: calculating the mean of SparseVector RDD
Date Mon, 12 Jan 2015 12:42:53 GMT
This was without using Kryo -- if I use kryo, I got errors about buffer
overflows (see above):

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
required: 8

Just calling colStats doesn't actually compute those statistics, does it?
It looks like the computation is only carried out once you call the .mean()
method.



On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng <mengxr@gmail.com> wrote:

> colStats() computes the mean values along with several other summary
> statistics, which makes it slower. How is the performance if you don't
> use kryo? -Xiangrui
>
> On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar <rokroskar@gmail.com> wrote:
> > thanks for the suggestion -- however, looks like this is even slower.
> With
> > the small data set I'm using, my aggregate function takes ~ 9 seconds and
> > the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
> > the Kyro serializer -- I get the error:
> >
> > com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
> > required: 8
> >
> > is there an easy/obvious fix?
> >
> >
> > On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> >>
> >> There is some serialization overhead. You can try
> >>
> >>
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
> >> . -Xiangrui
> >>
> >> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokroskar@gmail.com> wrote:
> >> > I have an RDD of SparseVectors and I'd like to calculate the means
> >> > returning
> >> > a dense vector. I've tried doing this with the following (using
> pyspark,
> >> > spark v1.2.0):
> >> >
> >> > def aggregate_partition_values(vec1, vec2) :
> >> >     vec1[vec2.indices] += vec2.values
> >> >     return vec1
> >> >
> >> > def aggregate_combined_vectors(vec1, vec2) :
> >> >     if all(vec1 == vec2) :
> >> >         # then the vector came from only one partition
> >> >         return vec1
> >> >     else:
> >> >         return vec1 + vec2
> >> >
> >> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
> >> > aggregate_combined_vectors)
> >> > means = means / nvals
> >> >
> >> > This turns out to be really slow -- and doesn't seem to depend on how
> >> > many
> >> > vectors there are so there seems to be some overhead somewhere that
> I'm
> >> > not
> >> > understanding. Is there a better way of doing this?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> > For additional commands, e-mail: user-help@spark.apache.org
> >> >
> >
> >
>

Mime
View raw message