spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Garbage stats in Random Forest leaf node?
Date Tue, 17 Mar 2015 20:05:35 GMT
There are two cases: minInstancesPerNode not satisfied or minInfoGain
not satisfied:

https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L729
https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L745

On Tue, Mar 17, 2015 at 12:59 PM, Chang-Jia Wang <cj@cjwang.us> wrote:
> Just curious, why most of the leaf nodes returns None, but just a couple returns default?
 Why would the gain invalid?
>
> C.J.
>
> On Mar 17, 2015, at 11:53 AM, Xiangrui Meng <mengxr@gmail.com> wrote:
>
>> This is the default value (Double.MinValue) for invalid gain:
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67
>>
>> Please ignore it. Maybe we should update `toString` to use scientific notation.
>>
>> -Xiangrui
>>
>>
>> On Mon, Mar 16, 2015 at 5:19 PM, cjwang <cj@cjwang.us> wrote:
>>> I dumped the trees in the random forest model, and occasionally saw a leaf
>>> node with strange stats:
>>>
>>> - pred=1.000000 prob=0.800000 imp=-1.000000
>>> gain=-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000
>>>
>>>
>>> Here impurity = -1 and gain = a giant negative number.  Normally, I would
>>> get a None from Node.stats at a leaf node.  Here it printed because Some(s)
>>> matches:
>>>
>>>            node.stats match {
>>>                case Some(s) => println(" imp=%f gain=%f" format(s.impurity,
>>> s.gain))
>>>                case None => println
>>>            }
>>>
>>>
>>> Is it a bug?
>>>
>>> This doesn't seem happening in the model from DecisionTree, but my data sets
>>> are limited.
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message