mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Zhou <bingo.y...@gmail.com>
Subject Re: The function of the parameter complemented in DecisionTreeBuilder
Date Fri, 02 Nov 2012 16:25:22 GMT
Hi Anca,

Thanks for replying, and it corrects my understanding. The method only use
the data passed to it to decide whether to split a node or not.  And I
might find a problem with the code. Please look at the code from line 277
to line 285 of this file,
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java?av=f

I agree with that when complemented = false, on that node we might only
branch on a subset of possible values of C, which is contained in the data
that is passed to the method.

However, when complemented = true, the split is still based on the same
possible values of C from the data that is passed to the method. As said by
the code  from line 278 to line 280, if a value of C is contained in the
entire dataset, but not the data that is passed to the method, the continue
statement is executed. So those values of C that are not contained in the
data passed to the method do not affect the method.

In a word, whether complemented is true or false, the result after
executing the code from line 267 to line 285 is the same.

On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <ancaleuca2005@gmail.com> wrote:

> Hi Yang,
>
> I think I understand it better now, as well. So this is what I think it
> does:
>
> First of all, I think it only affects the categorical node splits. It will
> work as following in this scenario:
> Let us consider a dataset D we want to build a decision tree from.
> Let's say the tree has been partially built, and we've reached a
> categorical attribute C that we want to split on.
>
> As I understand it, when parametrized = false, on that node we might only
> branch on a subset of possible values of C.
>
> When parametrized = true, however, we will 'force' branching on all
> possible values of C from the entire dataset, and replace the missing data
> with leaves having a label computed from the parent data (line 307):
>
> if (data.getDataset
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> >().isNumerical
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> >(data.getDataset
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> >().getLabelId
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> >()))
> {
>
> label = sum / data.size
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> >();
>
> } else {
>
> label = data.majorityLabel
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> >(rng);
>
> }
>
>
> I hope this is correct and helps with understanding it better.
>
>
> Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> it's the Jira task that introduced the DecisionTreeBuilder, take a
> look at the comments, maybe it'll help you as well.
>
>
>
> Anca
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message