spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: why decision trees do binary split?
Date Thu, 06 Nov 2014 09:46:45 GMT
I haven't seen that done before, which may be most of the reason - I am not
sure that is common practice.

I can see upsides - you need not pick candidate splits to test since there
is only one N-way rule possible. The binary split equivalent is N levels
instead of 1.

The big problem is that you are always segregating the data set entirely,
and making the equivalent of those N binary rules, even when you would not
otherwise bother because they don't add information about the target. The
subsets matching each child are therefore unnecessarily small and this
makes learning on each independent subset weaker.
 On Nov 6, 2014 9:36 AM, "jamborta" <jamborta@gmail.com> wrote:

> I meant above, that in the case of categorical variables it might be more
> efficient to create a node on each categorical value. Is there a reason why
> spark went down the binary route?
>
> thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message