spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peng Meng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
Date Wed, 12 Oct 2016 01:10:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567170#comment-15567170
] 

Peng Meng commented on SPARK-17870:
-----------------------------------

hi [~avulanov], the question here is not use raw chi2 scores or pvalues, the question is if
use raw chi2 scores, the DoF should be the same.   
"chi2-test is used multiple times" is another problem.  According to (http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever
a statistical test is used multiple times, then the probability of getting at least one error
increases.", this problem is partially solved by Select the p-values corresponding to Family-wise
error rate (SelectFwe, SPARK-17645). Thanks very much.

Hi [~srowen], I totally agree with your comments. Based on the DoF is different in Spark ChiSquare
value, we can use the p-values for Spark SelectKBest, and SelectPercentile. Thanks very much.

I will submit a pr for this.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> ------------------------------------------------------------------------
>
>                 Key: SPARK-17870
>                 URL: https://issues.apache.org/jira/browse/SPARK-17870
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>            Reporter: Peng Meng
>            Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  (line 233)
is wrong.
> For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic
(ChiSqure value) to select the features. It select the features with the largest ChiSqure
value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD),
and for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection results are
strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message