mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sam wu <swu5...@gmail.com>
Subject Random Forest possible error
Date Sat, 14 Dec 2013 17:24:29 GMT
Hi,

I am using random forest of Mahout. It works well when I don't use feature
descriptor with Ignore feature ( No I flag).

If using Ignore flag, the returned feature value is -1
(for in the code dataset.valueOf(aId, token) return -1).

I did some investigation, and found that there some problems in the
DataConverter.java

source code
------

 for (int attr = 0; attr < nball; attr++) {  --51
      if (ArrayUtils.contains(dataset.getIgnored(), attr)) {
        continue; // IGNORED
      }

      String token = tokens[attr].trim();

      if ("?".equals(token)) {
        // missing value
        return null;
      }

 if (dataset.isNumerical(aId)) { --63
        vector.set(aId++, Double.parseDouble(token));
      } else { // CATEGORICAL
        vector.set(aId, dataset.valueOf(aId, token)); --66
        aId++;
      }
-------
Let feature descriptor be 9 I N L (Breiman Example)
11 features, 1-9 Ignored, 10th is Numeric, 11th is label variable
(Is Breiman example really works  based on web instruction ?)

line 51 -- attr is #feature, 0-10
aId is filtered feature #, 0-1 ( two non-Ignored features)
Problem in line 66
if attr=10, Label feature
aId=1
token=true
dataset.valueOf(aId, token) return -1 , for current code, CATEGORICAL
feature valueOf() kind mixed aId and attr concept.

Just by changing line 66
vector.set(aId, dataset.valueOf(aId, token)); --66
to vector.set(aId, dataset.valueOf(attr, token));
not working, because some validation fails (also attr, aId mixture).



There might be things that I overlook, just some thoughts.


Sam

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message