spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Created] (SPARK-1390) Refactor RDD backed matrices
Date Wed, 02 Apr 2014 05:31:19 GMT
Xiangrui Meng created SPARK-1390:

             Summary: Refactor RDD backed matrices
                 Key: SPARK-1390
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Xiangrui Meng
            Assignee: Xiangrui Meng
            Priority: Blocker

The current interfaces of RDD backed matrices needs refactoring for v1.0 release. It would
be better if we have a clear separation of local matrices and those backed by RDD. Right now,
we have 

1. org.apache.spark.mllib.linalg.SparseMatrix, which is a wrapper over an RDD of matrix entries,
i.e., coordinate list format.
2. org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix, which is a wrapper over RDD[Array[Double]],
i.e. row-oriented format.

We will see naming collision when we introduce local SparseMatrix and the name TallSkinnyDenseMatrix
is not exact if we switch to RDD[Vector] instead of RDD[Array[Double]]. It would be better
to have "RDD" in the type name to suggest that operations will trigger a job.

The proposed names (all under org.apache.spark.mllib.linalg.rdd):

1. RDDMatrix: trait for matrices backed by one or more RDDs
2. CoordinateRDDMatrix: wrapper of RDD[RDDMatrixEntry]
3. RowRDDMatrix: wrapper of RDD[Vector] whose rows do not have special ordering
4. IndexedRowRDDMatrix: wrapper of RDD[(Long, Vector)] whose rows are associated with indices

The proposal is subject to charge, but it would be nice to make the changes before v1.0.

This message was sent by Atlassian JIRA

View raw message