spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Thom <>
Subject [MLlib] Scoring GBTs with a variable number of trees
Date Wed, 07 Jan 2015 23:01:42 GMT
Hi All,

I wonder if anyone has any experience with building Gradient Boosted Tree models in MLlib,
and can help me out. I'm trying to create a plot of the test error rate of a Gradient Boosted
Tree model as a function of number of trees, to determine the optimal number of trees in the
model. Does spark calculate (and store!) the error rate on each iteration of model building?
Can I get at those values somehow? Alternatively, having constructed a model, is it possible
to score with only a fixed number of trees? e.g. I built a model with 1000 trees, but I only
want to score the data with the first 100 trees. I could calculate the needed quantities by
hand if I could do that in some way.

The optimal number of trees in a GBM is typically determined by calculating the mean standard
error on each iteration when building the model. The final model is then considered "optimal"
when the MSE is minimum. i.e. in a plot of MSE vs Number of trees, the error rate will decrease
(as the model improves), hit a minimum (the optimal point), and then increase (as the model
starts to overfit the data).

Christopher Thom
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61 2 9292 6444



The contents of this email, including attachments, may be confidential information. If you
are not the intended recipient, any use, disclosure or copying of the information is unauthorised.
If you have received this email in error, we would be grateful if you would notify us immediately
by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message
from your system.

View raw message