spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qiping Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree
Date Fri, 29 Aug 2014 10:29:52 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115088#comment-14115088
] 

Qiping Li commented on SPARK-3272:
----------------------------------

Hi Joseph, sorry for the late reply, I still think we should store number of instances for
the left & right child because whether a node is leaf or not is determined by whether
the best split can split enough instances to both left and right child. 
Even a node has enough instances, if the best split doesn't satisfy min instance requirements,
it should still be a leaf.

As for invalid information gain value, it is just a constant value to denote that split makes
no sense because it doesn't satisfy min info gain or min instances per node requirements.
I think there should be a specific value to denote this because split that is invalid should
be marked as invalid split so the main loop knows to not pick this split, even though we can
calculate info gain for this split. 

> Calculate prediction for nodes separately from calculating information gain for splits
in decision tree
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3272
>                 URL: https://issues.apache.org/jira/browse/SPARK-3272
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Qiping Li
>             Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with calculation
of information gain stats for each possible splits. The value to predict for a specific node
is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then calculate information
gain stats for each split.
> This is also necessary if we want to support minimum instances per node parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207])
because when all splits don't satisfy minimum instances requirement , we don't use information
gain of any splits. There should be a way to get the prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message