flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations
Date Wed, 09 Nov 2016 21:37:58 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652113#comment-15652113
] 

Fabian Hueske commented on FLINK-3613:
--------------------------------------

Hi [~anmu], 
this issue proposes to add more built-in aggregation functions to the DataSet API. 
Since parts of the Table API are built on the DataSet API, such a feature could in principle
be used to implement for instance also stddev for batch tables.

However, this would only help for batch tables so we would also need an implementation for
streaming tables. Also, there are quite a few challenges when implementing these aggregation
functions for the DataSet API. I think Stephan had a good point, when he asked whether these
advanced functions would be better suited for the Table API which FLINK-4604 is all about.

So, I would rather opt to close this issue in favor of FLINK-4604.

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et al, International
Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces the numerical
errors that occur when adding a sequence of finite precision floating point numbers. Numerical
errors arise due to truncation and rounding. These errors can lead to numerical instability
when calculating variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message