In that pull request, DAGScheduler's getPreferredLocs can now be called by an external program (although the package level visibility constraint).

Parallel collections RDD doesn't allow you to specify locality constraints, but you can easily implement a new RDD that allows you to specify those constraints.

Some semi-pseudocode:


class LocalityConstraintRDD[T: ClassManifest](prev: RDD[T], locs: Array[String]) {

   override def compute = prev.compute _

   override def getPreferredLocations(split: Partition): Seq[String] = {
     List(locs(split.index))
   }
}






--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Mon, Sep 2, 2013 at 4:10 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:
Thank you! It's a very nice improvement :).

However, my situation is a bit different -- the code their tries to make each coalesced partition to have roughly same * number of parent partitions* , while in my case, the parent partitions could be quite imbalanced and I am trying to to make each coalesced partition to have roughly the same * SIZE *.

Of course, this requires the size of parent partitions to be known -- which is not a problem in my case as I would always generate it and cache it. This is probably not a common case thus I am happy to write my own (hacking) code to get it around -- but I need the location for each cached partitions...

By the way: Is it possible to assign preferred locations to ParallelCollectionRDD? (e.g. RDDs generated by sc.parallize).. Sorry if it is a silly question...

Best,
Wenlei



On Mon, Sep 2, 2013 at 12:28 AM, Reynold Xin <rxin@cs.berkeley.edu> wrote:


--
Reynold Xin, AMPLab, UC Berkeley



On Mon, Sep 2, 2013 at 3:24 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:
Hi,

I am wondering if it is possible to get the partition position of cached RDD? I am asking this because I am trying to avoid shuffling when performing coalesce operation. And the size of my partitions could be quite imbalance thus CoalescedRDD would probably not be a good solution in my case.

Thank you!

Best,
Wenlei

--
Wenlei Xie (谢文磊)

Department of Computer Science
5132 Upson Hall, Cornell University
Ithaca, NY 14853, USA
Phone: (607) 255-5577




--
Wenlei Xie (谢文磊)

Department of Computer Science
5132 Upson Hall, Cornell University
Ithaca, NY 14853, USA
Phone: (607) 255-5577