spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DB Tsai (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-1401) Use mapParitions instead of map to avoid creating expensive object in GradientDescent optimizer
Date Thu, 03 Apr 2014 01:05:14 GMT
DB Tsai created SPARK-1401:
------------------------------

             Summary: Use mapParitions instead of map to avoid creating expensive object in
GradientDescent optimizer
                 Key: SPARK-1401
                 URL: https://issues.apache.org/jira/browse/SPARK-1401
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: DB Tsai
            Priority: Minor


In GradientDescent, currently, each row of the input data will create its own gradient matrix
object, and then we sum them up in the reducer. 

We found that when the number of features are in the order of thousands, it becomes the bottleneck.
The situation was worse when we tested with Newton optimizer due to that the dimension of
hessian matrix is so huge. 

In our testing, when the # of features are hundreds of thousands, the GC kicks in for each
row of input, and it sometimes brings down the workers. 

With aggregating the lossSum, and gradientSum using mapPartitions, we solved the GC issue,
and scale better with # of features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message