spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-5688) Splits for Categorical Variables in DecisionTrees
Date Mon, 09 Feb 2015 16:22:34 GMT


Apache Spark commented on SPARK-5688:

User 'edenovit' has created a pull request for this issue:

> Splits for Categorical Variables in DecisionTrees
> -------------------------------------------------
>                 Key: SPARK-5688
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Any
>            Reporter: Eric Denovitzer
>              Labels: categorical, decisiontree
>             Fix For: 1.2.0
> The categories on each subset chosen to build a split on a categorical variable  was
not random. The categories for the subset are chosen based on the binary representation of
a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the
current implementation, the integers used for the subsets are 1..numSplits. This should be
random instead of biasing towards the categories with the lower indexes. 
> Another problem is that if numBins/2 is bigger than the possible subsets given a category
set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2
and  (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than
once when choosing the splits).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message