spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
Date Tue, 14 Feb 2017 21:42:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866755#comment-15866755
] 

Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM:
-----------------------------------------------------------------

Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{code}
val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features"))
{code}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols
are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the rewrite version?


was (Author: mlnick):
Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols
are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -----------------------------------------------------
>
>                 Key: SPARK-19208
>                 URL: https://issues.apache.org/jira/browse/SPARK-19208
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: zhengruifeng
>         Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using {{MultivariateOnlineSummarizer}}
to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused statistics. It
slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message