spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xinyong Tian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
Date Wed, 06 Jun 2018 03:42:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502794#comment-16502794
] 

Xinyong Tian commented on SPARK-24431:
--------------------------------------

I read more about first point of or curve
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
In the above example, when setting predicted probability for each row as 0.01, only one point
on pr curve is defined, ie recall=1, precision =0.01.  according to the website, first point
on the problem curve should be a horizontal line from 2nd point (the only point (1,0.01) here),
which should be (0,0.01).  In this way, the no model 's  areaUnderPR=0.01,  instead of 0.05.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---------------------------------------------------------------
>
>                 Key: SPARK-24431
>                 URL: https://issues.apache.org/jira/browse/SPARK-24431
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ...,  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) 
to select best model. when the regParam in logistict regression is very high, no variable
will be selected (no model), ie every row 's prediction is same ,eg. equal event rate (baseline
frequency). But at this point,  BinaryClassificationEvaluator set the areaUnderPR highest.
As a result  best model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be only two
points: at recall =0, precision should be set to  zero , while the software set it to 1.
at recall=1, precision is the event rate. As a result, the areaUnderPR will be close 0.5
(my even rate is very low), which is maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message