spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
Date Fri, 22 Feb 2019 04:16:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774750#comment-16774750
] 

Sean Owen commented on SPARK-26881:
-----------------------------------

I wouldn't make it configurable. I don't think it makes sense to set it such that the collect
size is any smaller than the max driver result size, nor bigger. It's already kind of configurable
by the max result size, although its meaning is only indirectly related.

> Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-26881
>                 URL: https://issues.apache.org/jira/browse/SPARK-26881
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.2.0
>            Reporter: Rafael RENAUDIN-AVINO
>            Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix computation: results
sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of the tree
aggregation by a heuristic computed based on the number of partitions, driver max result size,
and memory size of the dense vectors that are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to hear the
community opinion about such a fix to know if it's worth investing my time into a clean pull
request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects later versions
aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message