spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Alen <>
Subject [mllib] Random forest maxBins and confidence in training points
Date Tue, 18 Aug 2015 23:54:52 GMT
Hi everyone, 
I have two questions regarding the random forest implementation in mllib
1- maxBins: Say the value of a feature is between [0,100]. In my dataset there are a lot of
data points between [0,10] and one datapoint at 100 and nothing between (10, 100). I am wondering
how does the binning work in this case? I obviously don't want all my points that are in between
[0,10] to fall into the same bin and other bins to be empty.  would mllib do any smart reallocation
of bins such that each bin gets some datapoints in them and one bin does not get all the datapoints?
2- Is there any way to do this in Spark?
Thanks a lotMark

View raw message