spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shlomi <>
Subject Re: Splitting RDD by partition
Date Fri, 20 May 2016 08:59:28 GMT
Another approach I found:

First, I make a PartitionsRDD class which only takes a certain range of
case class PartitionsRDDPartition(val index:Int, val origSplit:Partition)
extends Partition {}

class PartitionsRDD[U: ClassTag](var prev: RDD[U], drop:Int,take:Int)
extends RDD[U](prev) {
  override def getPartitions: Array[Partition] =
prev.partitions.drop(drop).take(take){case (split,
idx)=>{new PartitionsRDDPartition(idx,
  override def compute(split: Partition, context: TaskContext): Iterator[U]

And then I can create my two RDD's using the following:
def splitByPartition[T:ClassTag](rdd: RDD[T], hotPartitions:Int): (RDD[T],
RDD[T]) = {
   val left  = new PartitionsRDD[T](rdd, 0, hotPartitions);
   val right = new PartitionsRDD[T](rdd, hotPartitions,
   (left, right)

This approach saves a few minutes when compared to the one in the previous
post (at least on a local test.. I still need to test this on a real

Any thought about this? Is this the right thing to do or am I missing
something important?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message