# spark-user mailing list archives

##### Site index · List index
Message view
Top
From Jeremy Freeman <freeman.jer...@gmail.com>
Subject Re: Computing cosine similiarity using pyspark
Date Tue, 27 May 2014 12:42:12 GMT
```Hi Jamal,

One nice feature of PySpark is that you can easily use existing functions
from NumPy and SciPy inside your Spark code. For a simple example, the
following uses Spark's cartesian operation (which combines pairs of vectors
into tuples), followed by NumPy's corrcoef to compute the pearson
correlation coefficient between every pair of a set of vectors. The vectors
are an RDD of numpy arrays.

>> from numpy import array, corrcoef

>> data = sc.parallelize([array([1,2,3]),array([2,4,6.1]),array([3,2,1.1])])
>> corrs = data.cartesian(data).map(lambda (x,y):
corrcoef(x,y)[0,1]).collect()
>> corrs
[1.0, 0.99990086740991746, -0.99953863896044948 ...

This just returns a list of the correlation coefficients, you could also
add a key to each array, to keep track of which pair is which

>> data_with_keys =
sc.parallelize([(0,array([1,2,3])),(1,array([2,4,6.1])),(2,array([3,2,1.1]))])
>> corrs_with_keys = data_with_keys.cartesian(data_with_keys).map(lambda
((k1,v1),(k2,v2)): ((k1,k2),corrcoef(v1,v2)[0,1])).collect()
>> corrs_with_keys
[((0, 0), 1.0), ((0, 1), 0.99990086740991746), ((0, 2),
-0.99953863896044948) ...

Finally, you could just replace corrcoef in either of the above
with scipy.spatial.distance.cosine to get your cosine similarity.

Hope that's useful, as Andrei said, the answer partly depends on exactly
what you're trying to do.

-- Jeremy

On Fri, May 23, 2014 at 2:41 PM, Andrei <faithlessfriend@gmail.com> wrote:

> Do you need cosine distance and correlation between vectors or between
> variables (elements of vector)? It would be helpful if you could tell us
>
>
> On Thu, May 22, 2014 at 5:49 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>
>> Hi,
>>   I have bunch of vectors like
>> [0.1234,-0.231,0.23131]
>> .... and so on.
>>
>> and  I want to compute cosine similarity and pearson correlation using
>> pyspark..
>> How do I do this?
>> Any ideas?
>> Thanks
>>
>
>

```
Mime
View raw message