spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soheil Pourbafrani <soheil.i...@gmail.com>
Subject Using columnSimilarity with threshold result in greater than one
Date Thu, 15 Nov 2018 17:56:14 GMT
Testing the *columnSimilarity *method in Spark, I create a *RowMatrix *
object:

val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0),
  (4.0, 5.0, 10.0), (1.0,3.0, 6.0)))

val rows = temp.map(line => {
  Vectors.dense(Array(line._1, line._2, line._3))
})

val mat = new RowMatrix(rows)


the matrix is:
5  1   4
2  3   8
4  5   10
1  3   6

It will return the cosinSimilarity of rows:
(5, 2, 4, 1)
(1, 3, 5, 3)
(4, 8, 10, 6)
that is :

MatrixEntry(0,2,0.8226366627527562)
MatrixEntry(0,1,0.755742181606458)
MatrixEntry(1,2,0.9847319278346619)

The problem is when I set threshold:

val est = mat.columnSimilarities(0.5)

and the result of some pairs will be greater than one and because it's
similarity the result should be between zero and one!

MatrixEntry(0,2,2.821741602543195)
MatrixEntry(0,1,1.319846878608914)

My primary question is what is the interpretation of results greater than
one?
Does Spark use the *DIMSUM* algorithm for just cosinSimilarities with a
threshold or it use DIMSUM for similarities without a threshold, too?

Mime
View raw message