spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: train many decision tress with a single spark job
Date Sun, 11 Jan 2015 12:12:27 GMT
You just mean you want to divide the data set into N subsets, and do
that dividing by user, not make one model per user right?

I suppose you could filter the source RDD N times, and build a model
for each resulting subset. This can be parallelized on the driver. For
example let's say you divide into N subsets depending on the value of
the user ID modulo N:

val N = ...
(0 until N) => DecisionTree.train(data.filter(_.userID % N
== d), ...))

data should be cache()-ed here of course.

However it may be faster and more principled to take random subsets directly:

data.randomSplit(Array.fill(N)(1.0 / N)) =>
DecisionTree.train(subset, ...))

On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <> wrote:
> I've got a data set of activity by user. For each user, I'd like to train a
> decision tree model. I currently have the feature creation step implemented
> in Spark and would naturally like to use mllib's decision tree model.
> However, it looks like the decision tree model expects the whole RDD and
> will train a single tree.
> Can I split the RDD by user (i.e. groupByKey) and then call the
> DecisionTree.trainClassifer in a reduce() or aggregate function to create a
> RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
> instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
> to create a mini-RDD?
> Has anyone else tried something like this with success?
> Thanks!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message