spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <jos...@databricks.com>
Subject Re: Different maxBins value for categorical and continuous features in RandomForest implementation.
Date Wed, 13 Apr 2016 03:05:09 GMT
That sounds useful.  Would you mind creating a JIRA for it?  Thanks!
Joseph

On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani <tanwanirahul@gmail.com>
wrote:

> Hi,
>
> Currently the RandomForest algo takes a single maxBins value to decide the
> number of splits to take. This sometimes causes training time to go very
> high when there is a single categorical column having sufficiently large
> number of unique values. This single column impacts all the numeric
> (continuous) columns even though such a high number of splits are not
> required.
>
> Encoding the  categorical column into features make the data very wide and
> this requires us to increase the maxMemoryInMB and puts more pressure on
> the
> GC as well.
>
> Keeping the separate maxBins values for categorial and continuous features
> should be useful in this regard.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message