spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@cs.berkeley.edu>
Subject Re: Getting the partition position of cached RDD?
Date Mon, 02 Sep 2013 10:52:25 GMT
In that pull request, DAGScheduler's getPreferredLocs can now be called by
an external program (although the package level visibility constraint).

Parallel collections RDD doesn't allow you to specify locality constraints,
but you can easily implement a new RDD that allows you to specify those
constraints.

Some semi-pseudocode:


class LocalityConstraintRDD[T: ClassManifest](prev: RDD[T], locs:
Array[String]) {

   override def compute = prev.compute _

   override def getPreferredLocations(split: Partition): Seq[String] = {
     List(locs(split.index))
   }
}






--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Mon, Sep 2, 2013 at 4:10 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:

> Thank you! It's a very nice improvement :).
>
> However, my situation is a bit different -- the code their tries to make
> each coalesced partition to have roughly same * number of parent
> partitions* , while in my case, the parent partitions could be quite
> imbalanced and I am trying to to make each coalesced partition to have
> roughly the same * SIZE *.
>
> Of course, this requires the size of parent partitions to be known --
> which is not a problem in my case as I would always generate it and cache
> it. This is probably not a common case thus I am happy to write my own
> (hacking) code to get it around -- but I need the location for each cached
> partitions...
>
> By the way: Is it possible to assign preferred locations to
> ParallelCollectionRDD? (e.g. RDDs generated by sc.parallize).. Sorry if it
> is a silly question...
>
> Best,
> Wenlei
>
>
>
> On Mon, Sep 2, 2013 at 12:28 AM, Reynold Xin <rxin@cs.berkeley.edu> wrote:
>
>> Does this help you? https://github.com/mesos/spark/pull/832
>>
>>
>> --
>> Reynold Xin, AMPLab, UC Berkeley
>> http://rxin.org
>>
>>
>>
>> On Mon, Sep 2, 2013 at 3:24 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am wondering if it is possible to get the partition position of cached
>>> RDD? I am asking this because I am trying to avoid shuffling when
>>> performing coalesce operation. And the size of my partitions could be quite
>>> imbalance thus CoalescedRDD would probably not be a good solution in my
>>> case.
>>>
>>> Thank you!
>>>
>>> Best,
>>> Wenlei
>>>
>>> --
>>> Wenlei Xie (谢文磊)
>>>
>>> Department of Computer Science
>>> 5132 Upson Hall, Cornell University
>>> Ithaca, NY 14853, USA
>>> Phone: (607) 255-5577
>>> Email: wenlei.xie@gmail.com
>>>
>>
>>
>
>
> --
> Wenlei Xie (谢文磊)
>
> Department of Computer Science
> 5132 Upson Hall, Cornell University
> Ithaca, NY 14853, USA
> Phone: (607) 255-5577
> Email: wenlei.xie@gmail.com
>

Mime
View raw message