spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Tucker <>
Subject Re: Partitioning a RDD for training multiple classifiers
Date Tue, 08 Sep 2015 15:32:56 GMT
Hi Maximo —

This is a relatively naive answer, but I would consider structuring the RDD
into a DataFrame, then saving the 'splits' using something like
DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then
read a DataFrame from each resulting parquet directory and do your
per-client work from these. You mention re-using the splits, so this
solution might be worth the file-writing time.

Does anyone know of a method that gets a collection of DataFrames — one for
each partition, in the byPartition=('client') sense — from a 'big'
DataFrame? Basically, the equivalent of writing by partition and creating a
DataFrame for each result, but skipping the HDFS step.

On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez <>

> Hi,
>     I have a RDD that needs to be split (say, by client) in order to train
> n models (i.e. one for each client). Since most of the classifiers that
> come with ml-lib only can accept an RDD as input (and cannot build multiple
> models in one pass - as I understand it), the only way to train n separate
> models is to create n RDDs (by filtering the original one).
> Conceptually:
> rdd1,rdd2,rdd3 = splitRdds(bigRdd)
> the function splitRdd would use the standard filter mechanism .  I would
> then need to submit n training spark jobs. When I do this, will it mean
> that it will traverse the bigRdd n times? Is there a better way to persist
> the splitted rdd (i.e. save the split RDD in a cache)?
> I could cache the bigRdd, but not sure that would be ver efficient either
> since it will require the same number of passes anyway (I think - but I’m
> relatively new to Spark). Also I’m planning on reusing the individual
> splits (rdd1, rdd2, etc so would be convenient to have them individually
> cached).
> Another problem is that the splits are could be very skewed (i.e. one
> split could represent a large percentage of the original bigRdd ). So
> saving the split RDDs to disk (at least, naively) could be a challenge.
> Is there any better way of doing this?
> Thanks!
>    Máximo

View raw message