spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics
Date Wed, 31 Dec 2014 21:38:13 GMT

     [ https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng resolved SPARK-4547.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

Issue resolved by pull request 3702
[https://github.com/apache/spark/pull/3702]

> OOM when making bins in BinaryClassificationMetrics
> ---------------------------------------------------
>
>                 Key: SPARK-4547
>                 URL: https://issues.apache.org/jira/browse/SPARK-4547
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 1.3.0
>
>
> Also following up on http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoXUxFpMA@mail.gmail.com%3E
-- this one I intend to make a PR for a bit later. The conversation was basically:
> {quote}
> Recently I was using BinaryClassificationMetrics to build an AUC curve for a classifier
over a reasonably large number of points (~12M). The scores were all probabilities, so tended
to be almost entirely unique.
> The computation does some operations by key, and this ran out of memory. It's something
you can solve with more than the default amount of memory, but in this case, it seemed unuseful
to create an AUC curve with such fine-grained resolution.
> I ended up just binning the scores so there were ~1000 unique values
> and then it was fine.
> {quote}
> and:
> {quote}
> Yes, if there are many distinct values, we need binning to compute the AUC curve. Usually,
the scores are not evenly distribution, we cannot simply truncate the digits. Estimating the
quantiles for binning is necessary, similar to RangePartitioner:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
> Limiting the number of bins is definitely useful.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message