mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Zhou <bingo.y...@gmail.com>
Subject Re: The function of the parameter complemented in DecisionTreeBuilder
Date Fri, 02 Nov 2012 17:49:10 GMT
Hi,

Sorry about those confusing words.

I do not mean a bug there.  What I mean, which is the same as what you
said,  is whether complemented is true or not, the value of cnt in line 288
is the same. And complemented does affect how the leaves are built.

Really appreciate your time!

On Sat, Nov 3, 2012 at 1:07 AM, Anca Leuca <ancaleuca2005@gmail.com> wrote:

> Hi,
>
> However, when complemented = true, the split is still based on the same
> > possible values of C from the data that is passed to the method.
>
>
> Yes. The split is indeed based on a subset of the data.
>
>
> > As said by
> > the code  from line 278 to line 280, if a value of C is contained in the
> > entire dataset, but not the data that is passed to the method, the
> continue
> > statement is executed. So those values of C that are not contained in the
> > data passed to the method do not affect the method.
> >
>
> Not sure what you mean by 'affect the method'. I think the datapoints that
> refer to values of C not contained in the data passed are not meant to
> change the calculations.
> Also, *c**ontinue* is being called twice: in the loop 277-285 and the loop
> 303-317, under the same conditions. So technically I don't think there's a
> bug there, although admittedly it's not a very clean/obvious solution :).
>
>
> > In a word, whether complemented is true or false, the result after
> > executing the code from line 267 to line 285 is the same.
> >
>
> Again, I am not sure what you mean by 'result'. If you mean the variable *
> subsets*, yes, that one will have the same value, regardless of
> complemented. The interesting stuff, however, happens in lines 302-332,
> where the 'complementing' leaves are being built.
>
> That being said, I think the best approach would be to just give the tree
> builder a test and see what it spits out, for a simple dataset that you can
> eyeball. Or have a look at the unit tests (if any), they should also give a
> clue on what was meant.
>
> Anca
>
>
> > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <ancaleuca2005@gmail.com>
> > wrote:
> >
> > > Hi Yang,
> > >
> > > I think I understand it better now, as well. So this is what I think it
> > > does:
> > >
> > > First of all, I think it only affects the categorical node splits. It
> > will
> > > work as following in this scenario:
> > > Let us consider a dataset D we want to build a decision tree from.
> > > Let's say the tree has been partially built, and we've reached a
> > > categorical attribute C that we want to split on.
> > >
> > > As I understand it, when parametrized = false, on that node we might
> only
> > > branch on a subset of possible values of C.
> > >
> > > When parametrized = true, however, we will 'force' branching on all
> > > possible values of C from the entire dataset, and replace the missing
> > data
> > > with leaves having a label computed from the parent data (line 307):
> > >
> > > if (data.getDataset
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > >().isNumerical
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > > >(data.getDataset
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > >().getLabelId
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > > >()))
> > > {
> > >
> > > label = sum / data.size
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > > >();
> > >
> > > } else {
> > >
> > > label = data.majorityLabel
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > > >(rng);
> > >
> > > }
> > >
> > >
> > > I hope this is correct and helps with understanding it better.
> > >
> > >
> > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> > > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > > look at the comments, maybe it'll help you as well.
> > >
> > >
> > >
> > > Anca
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message