spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training
Date Sat, 18 Oct 2014 00:37:33 GMT


Xiangrui Meng updated SPARK-3161:
    Assignee: Sung Chung

> Cache example-node map for DecisionTree training
> ------------------------------------------------
>                 Key: SPARK-3161
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Sung Chung
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped to a node
in the current level (or to none if it does not reach that level).  This is currently done
via the function predictNodeIndex(), which traces from the current tree’s root node to the
given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests |]
where each instance is stored with an array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\],
where each partition stores an array of maps from node indices to an array of instances. 
This has more overhead in data structures but could be more efficient: not all nodes are split
on each iteration, and this would allow each executor to ignore instances which are not used
for the current node set.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message