From Reed Villanueva <>
Subject What happens if a random forest max bins is set too high?
Date Tue, 15 Jun 2021 06:50:43 GMT
What happens if a random forest "max bins" hyperparameter is set too high?

When training a sparkml random forest (
) with maxBins set roughly equal to the max number of distinct categorical
values for any given feature I see OK performance metrics. But when I set
it closer to 2x or 3x the number of distinct categorical values,
performance is terrible (eg. accuracy (in the case of a binary classifier)
being no better than just the actual distribution of responses in the
dataset) and the feature importances (
) being all zeros (as opposed to when using the lower initial maxBins value
where it at does show *something* for the importances).

I would not think that there would be such a huge difference just from a
change in max bins like this (esp. the difference in seeing *something* vs
absolutely nothing / all zeros for the feature importances).

What could be happening under the hood of the algo that causes such
different outcomes when this parameter is changed like this?

