spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uday Babbar (JIRA)" <>
Subject [jira] [Created] (SPARK-25911) [spark-ml] Hypothesis testing module
Date Thu, 01 Nov 2018 18:09:00 GMT
Uday Babbar created SPARK-25911:

             Summary: [spark-ml] Hypothesis testing module
                 Key: SPARK-25911
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Uday Babbar

h2. Why this ticket was created

Feasibility determination of some subset of hypothesis testing module mainly along value proposition
front and to get a preliminary opinion of how does it generally sound. Can work on a more
comprehensive proposal if say, it's generally agreed upon that including dataframe API for
t-test makes sense in the package. 
h2. Current state

There are some streaming implementation in the o.a.s.mllib module, but there are no dataframe
APIs for some standard tests (t-test). 
||Test ||Current state||Proposed state||
|t-test (welch's, student)|only streaming |Dataframe API|
|chi-squared|streaming, Dataframe/RDD API present| - |
|ANOVA|-|Dataframe API|
|mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make sense to include
h2. Rationale 

The utility of experimentation platforms is pervasive and most of them that operate at scale
(a large portion of them use spark for offline computation) require distributed implementation
of hypothesis tests to calculate p-values of different metrics/features. These APIs would
enable distributed computation of the relevant stats and prevent overhead in moving data (or
some downstream view of it) to a framework where such stats computation is available (R, scipy). 



This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message