spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-1390) Refactor RDD backed matrices
Date Wed, 02 Apr 2014 05:31:19 GMT
Xiangrui Meng created SPARK-1390:
------------------------------------

             Summary: Refactor RDD backed matrices
                 Key: SPARK-1390
                 URL: https://issues.apache.org/jira/browse/SPARK-1390
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Xiangrui Meng
            Assignee: Xiangrui Meng
            Priority: Blocker


The current interfaces of RDD backed matrices needs refactoring for v1.0 release. It would
be better if we have a clear separation of local matrices and those backed by RDD. Right now,
we have 

1. org.apache.spark.mllib.linalg.SparseMatrix, which is a wrapper over an RDD of matrix entries,
i.e., coordinate list format.
2. org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix, which is a wrapper over RDD[Array[Double]],
i.e. row-oriented format.

We will see naming collision when we introduce local SparseMatrix and the name TallSkinnyDenseMatrix
is not exact if we switch to RDD[Vector] instead of RDD[Array[Double]]. It would be better
to have "RDD" in the type name to suggest that operations will trigger a job.

The proposed names (all under org.apache.spark.mllib.linalg.rdd):

1. RDDMatrix: trait for matrices backed by one or more RDDs
2. CoordinateRDDMatrix: wrapper of RDD[RDDMatrixEntry]
3. RowRDDMatrix: wrapper of RDD[Vector] whose rows do not have special ordering
4. IndexedRowRDDMatrix: wrapper of RDD[(Long, Vector)] whose rows are associated with indices

The proposal is subject to charge, but it would be nice to make the changes before v1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message