Yes I looked at the impl here, and I think it is aging, since I'm not
sure Deneche had time to put in many bells or whistles at the start,
and not sure it's been touched much since.
My limited experience is that it generally does less clever stuff than
R, which in turn is less clever than sklearn et al. hence the gap in
results. There are lots of ways you can do better than the original
Breiman paper, which is what R sticks to mostly.
Weirdly I was just having a long conversation about this exact topic
today, since I'm working on an RDF implementation on Hadoop. (I think
it might be worth a new implementation after this much time, if one
were looking to revamp RDF on Hadoop and inject some new tricks. It
needs some different design choices.)
Anyway, the question was indeed which splits of an Nvalued
categorical (nominal) variable to consider? because considering all
2^N2 of them is not scalable, especially since I don't want any limit
on N.
There are easy, fast ways to figure out what splits to consider for
every other combination of categorical/numeric feature F predicting
categorical/numeric target T, but I couldn't find any magic for one:
categorical F predicting categorical T.
I ended up making up a heuristic that is at least linear in N, and I
wonder if anyone is a) interested in talking about this at all or b)
has the magic answer here.
So  sort the values of the F by the entropy of T considered over
examples for just that value of F. Then consider splits based on
prefixes of that list. So if F = [a, b, c, d] and in order by entropy
of T they are [b, c, a, d] then consider rules like F in {b}, F in
{b,c}, F in {b,c,a}.
This isn't a great heuristic but seems to work well in practice.
I suppose it's this and a lot of other little tricks like that that
could improve this or any other implementation  RDF makes speed and
accuracy pretty tradeoffable, so anything that makes things faster
can make it instead more accurate or vice versa.
Definitely an interesting topic I'd be interested to cover with anyone
building RDFs now.
On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave <debarr@mitre.org> wrote:
> Another difference...
>
> R's randomForest package (which RRF is based on) evaluates subsets of values when partitioning
nominal values. [This is why it complains if there are more than 32 distinct values for a
nominal variable.]
>
> For example, if our nominal variable has values { A, B, C, D }, the package will consider
"in { A, C }" versus "not in { A, C }" as a partition candidate.
>
> Original Message
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Friday, October 18, 2013 10:42 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout 0.8 Random Forest Accuracy
>
> On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut <tim@timpeut.com> wrote:
>
>> Has anyone found that Mahout's random forest doesn't perform as well as
>> other implementations? If not, is there any reason why it wouldn't perform
>> as well?
>>
>
> This is disappointing, but not entirely surprising. There has been
> considerably less effort applied to Mahouts random forest package than the
> comparable R packages.
>
> Note, particularly that the Mahout implementation is not regularized. That
> could well be a big difference.
