spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Feinberg (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml
Date Tue, 13 Sep 2016 01:24:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vladimir Feinberg updated SPARK-16728:
--------------------------------------
    Description: 
Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous
ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based
splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd order. These are
useless for several impurity measures and inadequate for others (e.g., absolute loss or huber
loss). It needs some renovation.
3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and Impurity into
a single class (and use virtual calls rather than case statements when toggling over impurity
types).


  was:
Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous
ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based
splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd order. These are
useless for several impurity measures and inadequate for others (e.g., absolute loss or huber
loss). It needs some renovation.


> migrate internal API for MLlib trees from spark.mllib to spark.ml
> -----------------------------------------------------------------
>
>                 Key: SPARK-16728
>                 URL: https://issues.apache.org/jira/browse/SPARK-16728
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Vladimir Feinberg
>
> Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with
this:
> 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous
ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based
splits for complex loss functions).
> 2. The old impurity API only lets you use summary statistics up to the 2nd order. These
are useless for several impurity measures and inadequate for others (e.g., absolute loss or
huber loss). It needs some renovation.
> 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and Impurity
into a single class (and use virtual calls rather than case statements when toggling over
impurity types).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message