spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-30381) GBT reuse splits for all trees
Date Sun, 29 Dec 2019 11:05:00 GMT
zhengruifeng created SPARK-30381:
------------------------------------

             Summary: GBT reuse splits for all trees
                 Key: SPARK-30381
                 URL: https://issues.apache.org/jira/browse/SPARK-30381
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng
            Assignee: zhengruifeng


In existing GBT, for each tree, it will first compute avaiable splits of each feature (via
RandomForest.findSplits), based on sampled dataset at this iteration. Then it will use these
splits to discretize vectors into BaggedPoint[TreePoint]s. The BaggedPoints (of the same size
of input vectors) are then cached and used at this iteration. Note that the splits for discretization
in each tree are different (if subsamplingRate<1), only because the sampled vectors are
different.

However, the splits at different iterations shoud be similar if sampled dataset is big enough,
and even the same if subsamplingRate=1.

 

However, in other famous GBT impls (like XGBoost/lightGBM) with binned features, the splits
for discretization is the same for different iterations:
{code:java}
import xgboost as xgb
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
dtrain = xgb.DMatrix(X[:, :2], label=y)
num_round = 3
param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist',
'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
bst = xgb.train(param, dtrain, num_round)
bst.trees_to_dataframe('/tmp/bst')
Out[61]: 
    Tree  Node   ID Feature     Split  Yes   No Missing        Gain  Cover
0      0     0  0-0      f1  0.000408  0-1  0-2     0-1  170.337143  256.0
1      0     1  0-1      f0  0.003531  0-3  0-4     0-3   44.865482  121.0
2      0     2  0-2      f0  0.003531  0-5  0-6     0-5  125.615570  135.0
3      0     3  0-3    Leaf       NaN  NaN  NaN     NaN   -0.010050   67.0
4      0     4  0-4    Leaf       NaN  NaN  NaN     NaN    0.002126   54.0
5      0     5  0-5    Leaf       NaN  NaN  NaN     NaN    0.020972   69.0
6      0     6  0-6    Leaf       NaN  NaN  NaN     NaN    0.001714   66.0
7      1     0  1-0      f0  0.003531  1-1  1-2     1-1   50.417793  263.0
8      1     1  1-1      f1  0.000408  1-3  1-4     1-3   48.732742  124.0
9      1     2  1-2      f1  0.000408  1-5  1-6     1-5   52.832161  139.0
10     1     3  1-3    Leaf       NaN  NaN  NaN     NaN   -0.012784   63.0
11     1     4  1-4    Leaf       NaN  NaN  NaN     NaN   -0.000287   61.0
12     1     5  1-5    Leaf       NaN  NaN  NaN     NaN    0.008661   64.0
13     1     6  1-6    Leaf       NaN  NaN  NaN     NaN   -0.003624   75.0
14     2     0  2-0      f1  0.000408  2-1  2-2     2-1   62.136013  242.0
15     2     1  2-1      f0  0.003531  2-3  2-4     2-3  150.537781  118.0
16     2     2  2-2      f0  0.003531  2-5  2-6     2-5    3.829046  124.0
17     2     3  2-3    Leaf       NaN  NaN  NaN     NaN   -0.016737   65.0
18     2     4  2-4    Leaf       NaN  NaN  NaN     NaN    0.005809   53.0
19     2     5  2-5    Leaf       NaN  NaN  NaN     NaN    0.005251   60.0
20     2     6  2-6    Leaf       NaN  NaN  NaN     NaN    0.001709   64.0
 {code}
 

We can see that even if we set subsample=0.5, the three trees share the same splits.

 

So I think we could reuse the splits and treePoints at all iterations:

at iteration=0, compute the splits on whole training dataset, and use the splits to generate
treepoints.

At each iteration, directly generate baggedPoints based on the treePoints.

Here we do not need to persist/unpersist the internal training dataset for each tree.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message