spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Buffum <>
Subject train many decision tress with a single spark job
Date Sun, 11 Jan 2015 01:53:20 GMT
I've got a data set of activity by user. For each user, I'd like to train a
decision tree model. I currently have the feature creation step implemented
in Spark and would naturally like to use mllib's decision tree model.
However, it looks like the decision tree model expects the whole RDD and
will train a single tree.

Can I split the RDD by user (i.e. groupByKey) and then call the
DecisionTree.trainClassifer in a reduce() or aggregate function to create a
RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
to create a mini-RDD?

Has anyone else tried something like this with success?


View raw message