spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reed Villanueva <villanuevar...@gmail.com>
Subject What happens if a random forest max bins is set too high?
Date Tue, 15 Jun 2021 06:50:43 GMT
What happens if a random forest "max bins" hyperparameter is set too high?

When training a sparkml random forest (
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
) with maxBins set roughly equal to the max number of distinct categorical
values for any given feature I see OK performance metrics. But when I set
it closer to 2x or 3x the number of distinct categorical values,
performance is terrible (eg. accuracy (in the case of a binary classifier)
being no better than just the actual distribution of responses in the
dataset) and the feature importances (
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.RandomForestClassificationModel.featureImportances
) being all zeros (as opposed to when using the lower initial maxBins value
where it at does show *something* for the importances).

I would not think that there would be such a huge difference just from a
change in max bins like this (esp. the difference in seeing *something* vs
absolutely nothing / all zeros for the feature importances).

What could be happening under the hood of the algo that causes such
different outcomes when this parameter is changed like this?

Mime
View raw message