flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations
Date Mon, 04 Apr 2016 14:42:25 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224259#comment-15224259
] 

Stephan Ewen commented on FLINK-3613:
-------------------------------------

The design of the extended aggregators makes a lot of sense. I agree with Fabian that we should
discuss two things first, however:

  1. Do we want such extended aggregations in the DataSet API, or basically push people to
use the Table API instead? My gut feeling is that it makes sense to have this in the DataSet
API if we answer (2) with "yes" have a good design for (3).
  2. I assume it should allow to use multiple aggregation functions, such that one could create
something {{like (a, b) --> (max(a), min(a), avg(b))}}
  3. How do we want the signatures for this to look? Ideally making this typesafe via a builder
(similar to the CSV input on ExecutionEnvironment).


> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et al, International
Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces the numerical
errors that occur when adding a sequence of finite precision floating point numbers. Numerical
errors arise due to truncation and rounding. These errors can lead to numerical instability
when calculating variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message