spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximo Gurmendez <>
Subject Re: Partitioning a RDD for training multiple classifiers
Date Wed, 09 Sep 2015 19:51:11 GMT
Adding an example (very raw), to see if my understanding is correct:

 val repartitioned = bidRdd.partitionBy(new Partitioner {
    def numPartitions: Int = 100

    def getPartition(clientId: Any): Int = hash(clientId) % 100

val cachedRdd =  repartitioned.cache()

val client1Rdd = cachedRdd.filter({case (clientId:String,v:String) => clientId.equals(“Client

val client2Rdd = cachedRdd.filter({case (clientId:String,v:String) => clientId.equals(“Client


When the first MyModel.train() runs it will trigger an action and will cause the repartition
of the original bigRdd and its caching.

Can someone confirm if the following statements are true?

1) When I execute an action on client2Rdd, it will only read from the partition that corresponds
to Client 2 (it won’t iterate over ALL items originally in bigRdd)
2) The caching happens in a way that preserves the partitioning by client Id (and the locality)


PD: I am aware that this might cause imbalance of data, but I can probably mitigate that with
a smarter partitioner.

On Sep 9, 2015, at 9:30 AM, Maximo Gurmendez <<>>

Thanks Ben for your answer. I’ll explore what happens under the hoods in a  data frame.

Regarding the ability to split a large RDD into n RDDs without requiring n passes to the large
RDD.  Can partitionBy() help? If I partition by a key that corresponds to the the split criteria
(i..e client id) and then cache each of those RDDs. Will that lessen the effect of repeated
large traversals (since Spark will figure out that for each smaller RDD it just needs to traverse
a subset of the partitions)?


On Sep 8, 2015, at 11:32 AM, Ben Tucker <<>>

Hi Maximo —

This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame,
then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')).
You could then read a DataFrame from each resulting parquet directory and do your per-client
work from these. You mention re-using the splits, so this solution might be worth the file-writing

Does anyone know of a method that gets a collection of DataFrames — one for each partition,
in the byPartition=('client') sense — from a 'big' DataFrame? Basically, the equivalent
of writing by partition and creating a DataFrame for each result, but skipping the HDFS step.

On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez <<>>
    I have a RDD that needs to be split (say, by client) in order to train n models (i.e.
one for each client). Since most of the classifiers that come with ml-lib only can accept
an RDD as input (and cannot build multiple models in one pass - as I understand it), the only
way to train n separate models is to create n RDDs (by filtering the original one).


rdd1,rdd2,rdd3 = splitRdds(bigRdd)

the function splitRdd would use the standard filter mechanism .  I would then need to submit
n training spark jobs. When I do this, will it mean that it will traverse the bigRdd n times?
Is there a better way to persist the splitted rdd (i.e. save the split RDD in a cache)?

I could cache the bigRdd, but not sure that would be ver efficient either since it will require
the same number of passes anyway (I think - but I’m relatively new to Spark). Also I’m
planning on reusing the individual splits (rdd1, rdd2, etc so would be convenient to have
them individually cached).

Another problem is that the splits are could be very skewed (i.e. one split could represent
a large percentage of the original bigRdd ). So saving the split RDDs to disk (at least, naively)
could be a challenge.

Is there any better way of doing this?


View raw message