mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Random Forest possible error
Date Sun, 15 Dec 2013 08:10:12 GMT
Finding problems is never bad, even if misdiagnosed the first time around.


On Sat, Dec 14, 2013 at 4:05 PM, sam wu <swu5530@gmail.com> wrote:

> Hi Ted,
>
> some more debugging, my previous statement is not correct, please
> dis-regards.
> There is problem i am sure. I am using InMemeoryMapper, one of the ways to
> load data. And I found problem there.
> I am going to compare with other approach (partial, Breiman) to see what's
> the difference.
>
> My bad, well It's Saturday !
>
> Sam
>
>
> On Sat, Dec 14, 2013 at 1:38 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Can you file a JIRA at https://issues.apache.org/jira/browse/MAHOUT ?
> >
> > It sounds like you have a test case in mind along with your fix.  If you
> > could package that work up as a patch file, then it would be much
> > appreciated.
> >
> >
> > On Sat, Dec 14, 2013 at 9:24 AM, sam wu <swu5530@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am using random forest of Mahout. It works well when I don't use
> > feature
> > > descriptor with Ignore feature ( No I flag).
> > >
> > > If using Ignore flag, the returned feature value is -1
> > > (for in the code dataset.valueOf(aId, token) return -1).
> > >
> > > I did some investigation, and found that there some problems in the
> > > DataConverter.java
> > >
> > > source code
> > > ------
> > >
> > >  for (int attr = 0; attr < nball; attr++) {  --51
> > >       if (ArrayUtils.contains(dataset.getIgnored(), attr)) {
> > >         continue; // IGNORED
> > >       }
> > >
> > >       String token = tokens[attr].trim();
> > >
> > >       if ("?".equals(token)) {
> > >         // missing value
> > >         return null;
> > >       }
> > >
> > >  if (dataset.isNumerical(aId)) { --63
> > >         vector.set(aId++, Double.parseDouble(token));
> > >       } else { // CATEGORICAL
> > >         vector.set(aId, dataset.valueOf(aId, token)); --66
> > >         aId++;
> > >       }
> > > -------
> > > Let feature descriptor be 9 I N L (Breiman Example)
> > > 11 features, 1-9 Ignored, 10th is Numeric, 11th is label variable
> > > (Is Breiman example really works  based on web instruction ?)
> > >
> > > line 51 -- attr is #feature, 0-10
> > > aId is filtered feature #, 0-1 ( two non-Ignored features)
> > > Problem in line 66
> > > if attr=10, Label feature
> > > aId=1
> > > token=true
> > > dataset.valueOf(aId, token) return -1 , for current code, CATEGORICAL
> > > feature valueOf() kind mixed aId and attr concept.
> > >
> > > Just by changing line 66
> > > vector.set(aId, dataset.valueOf(aId, token)); --66
> > > to vector.set(aId, dataset.valueOf(attr, token));
> > > not working, because some validation fails (also attr, aId mixture).
> > >
> > >
> > >
> > > There might be things that I overlook, just some thoughts.
> > >
> > >
> > > Sam
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message