spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximo Gurmendez <>
Subject Partitioning a RDD for training multiple classifiers
Date Tue, 08 Sep 2015 14:47:01 GMT
    I have a RDD that needs to be split (say, by client) in order to train n models (i.e.
one for each client). Since most of the classifiers that come with ml-lib only can accept
an RDD as input (and cannot build multiple models in one pass - as I understand it), the only
way to train n separate models is to create n RDDs (by filtering the original one). 


rdd1,rdd2,rdd3 = splitRdds(bigRdd)  

the function splitRdd would use the standard filter mechanism .  I would then need to submit
n training spark jobs. When I do this, will it mean that it will traverse the bigRdd n times?
Is there a better way to persist the splitted rdd (i.e. save the split RDD in a cache)? 

I could cache the bigRdd, but not sure that would be ver efficient either since it will require
the same number of passes anyway (I think - but I’m relatively new to Spark). Also I’m
planning on reusing the individual splits (rdd1, rdd2, etc so would be convenient to have
them individually cached). 

Another problem is that the splits are could be very skewed (i.e. one split could represent
a large percentage of the original bigRdd ). So saving the split RDDs to disk (at least, naively)
could be a challenge. 

Is there any better way of doing this?


View raw message