spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From OBones <obo...@free.fr>
Subject [ML] RandomForestRegressor training set size for each trees
Date Mon, 05 Mar 2018 11:07:23 GMT
We are using |RandomForestRegressor| from Spark 2.1.1 to train a model.

To make sure we have the appropriate parameters we start with a very 
small dataset, one that has 6024 lines. The regressor is created with 
this code:

|val rf = new RandomForestRegressor() .setLabelCol("MyLabel") 
.setFeaturesCol("MyFeatures") .setImpurity("variance") .setMaxDepth(3.) 
.setMinInstancesPerNode(1) .setMinInfoGain(0) .setNumTrees(2) 
.setFeatureSubsetStrategy("onethird") .setMaxBins(32) 
.setSubsamplingRate(1) val model = rf.fit(train) |

Using the debugger I can observe the |ImpurityStats| for each |rootNode| 
on each |DecisionTreeModel| inside the |trees| array. The stat that I am 
interested in is the first one in the |stats| array because it is the 
number of rows that the node has been trained with.

What I find strange is that this value for each |rootNode| is not always 
6024 but sometimes more and sometimes less.
 From my understanding of the method I was under the impression that 
each tree would be trained with exactly the same number of rows than the 
original training set.

Looking at the source code, I could not fully figure out where this 
happens, nor why it was decided to do so.

Are there any resources discussing this behavior?


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message