spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris T (JIRA)" <>
Subject [jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
Date Tue, 27 Jan 2015 22:43:35 GMT


Chris T commented on SPARK-5436:

The usual way that GBT models are evaluated is by calculating an error metric on a hold-out
test/validation data set. The error metric is often something simple, like Mean Squared Error.
When a plot of MSE vs Model NumTrees is made, we typically see something like this:

In the early stages, the model predictions improve. After the model passes the optimal number
of trees, the predictions degrade, due to the model overfitting. At the moment, one solution
to obtain this information has been to extract the trees from the model (GradientBoostedTreeModel.trees
returns an Array of DecisionTreeModel), and iteratively recreate a sub-model, scoring the
test data against each submodel. This is fairly expensive. 

Is there a model error metric that is calculated internally (e.g. by the gradient descent
algorithm)? If this was retained, I think there would be a lot of value. Ideally, it would
retain the model error for each tree during the build phase. It would then be fairly trivial
to create a submodel that yields optimal predictions.

> Validate GradientBoostedTrees during training
> ---------------------------------------------
>                 Key: SPARK-5436
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
> For Gradient Boosting, it would be valuable to compute test error on a separate validation
set during training.  That way, training could stop early based on the test error (or some
other metric specified by the user).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message