spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chunnan Yao <>
Subject Indices of SparseVector must be ordered while computing SVD
Date Wed, 22 Apr 2015 15:29:31 GMT
Hi all, 
I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really
confused me today. At first I thought my implementation is wrong. It turns
out it's an issue in MLlib. Fortunately, I've figured it out. 

I suggest to add a hint on user document of MLlib ( as far as I know, there
have not been such hints yet) that  indices of Local Sparse Vector must be
ordered in ascending manner. Because of ignorance of this point, I spent a
lot of time looking for reasons why computeSVD of RowMatrix did not run
correctly on Sparse data. I don't know the influence of Sparse Vector
without ordered indices on other functions, but I believe it is necessary to
let the users know or fix it. Actually, it's very easy to fix. Just add a
sortBy function in internal construction of SparseVector. 

Here is an example to reproduce the affect of unordered Sparse Vector on
//in spark-shell, Spark 1.3.1 
 import org.apache.spark.mllib.linalg.distributed.RowMatrix 
 import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,

  val sparseData_ordered = Seq( 
    Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
    Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) 
  val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,

  val sparseData_not_ordered = Seq( 
    Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
    Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
    Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) 
 val sparseMat_not_ordered = new
RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) 

//apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
the same matirx 
//however, the computeSVD result of these two matrixes are different. Users
should be notified about this situation. 
The results are: 

not ordered: 

Looking into this issue, I can see it's reason locates in
RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
ordered indices. Because it is scanning the indices consecutively to skip
empty columns. 

Feel the sparking Spark!
View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message