spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sneha Shukla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
Date Fri, 01 Jun 2018 11:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497869#comment-16497869
] 

Sneha Shukla commented on SPARK-8614:
-------------------------------------

Is this resolved in the any of the later versions of Spark? We believe we're encountering
the same issue.

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---------------------------------------------------------------
>
>                 Key: SPARK-8614
>                 URL: https://issues.apache.org/jira/browse/SPARK-8614
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>            Reporter: Jan Luts
>            Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are dropped
before calling the methods from RowMatrix. For example for IndexedRowMatrix.computeSVD:
>    val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>    val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, e.g. for IndexedRowMatrix.computeSVD
>    val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>       IndexedRow(i, v)
>    }
> and for IndexedRowMatrix.multiply:
>    
>    val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>       IndexedRow(i, v)
>    }
> I have experienced that for IndexedRowMatrix.computeSVD().U and IndexedRowMatrix.multiply()
(which both depend on RowMatrix.multiply) row indices can get mixed (when running Spark jobs
with multiple executors/machines): i.e. the vectors and indices of the result do not seem
to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within partitions is
preserved, but that it seems to get mixed up between partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no longer
there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message