spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <>
Subject [jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
Date Thu, 23 Feb 2017 15:19:44 GMT


Nick Pentreath commented on SPARK-19634:

Thanks [~timhunter].

In terms of performance, we expect to gain from (a) not computing unnecessary metrics or values
(saving mainly in memory usage for the intermediate arrays created, potentially some computation
saving); and (b) using UDAF.

Do we expect a large gain from using UDAF? I'm not totally up to date on the current state
of UDAF integration into working with Tungsten data, but my last impression was that (a) UDAFs
didn't really offer this unless they're internal (like HyperLogLog) and (b) array storage
& SerDe in Tungsten was still a bit patchy. Has this changed?

Of course in terms of API it is beneficial and we should do it anyway under the assumption
that performance is at least the same as the current implementation. I just want to understand
the expected performance gains since the implicit assumption is always "DataFrame operations
will be so much faster" but in practice this is not always the case for more complex data
types & situations, and things that switch into RDDs anyway under the hood such as in
the linear models cases...  

> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>                 Key: SPARK-19634
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Timothy Hunter
> This ticket tracks porting the functionality of spark.mllib.MultivariateOnlineSummarizer
over to
> A design has been discussed in SPARK-19208 . Here is a design doc:

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message