spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankur Dave <ankurd...@gmail.com>
Subject Re: Effecient way to fetch all records on a particular node/partition in GraphX
Date Sun, 17 May 2015 18:45:29 GMT
If you know the partition IDs, you can launch a job that runs tasks on only
those partitions by calling sc.runJob
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686>.
For example, we do this in IndexedRDD
<https://github.com/amplab/spark-indexedrdd/blob/f0c42dcad1f49ce36140f0c1f7d2c3ed61ed373e/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/IndexedRDDLike.scala#L100>
to get particular keys without launching a task on every partition.

Ankur <http://www.ankurdave.com/>

On Sun, May 17, 2015 at 8:32 AM, mas <mas.hamza@gmail.com> wrote:

> I have distributed my RDD into say 10 nodes. I want to fetch the data that
> resides on a particular node say "node 5". How i can achieve this?
> I have tried mapPartitionWithIndex function to filter the data of that
> corresponding node, however it is pretty expensive.
>

Mime
View raw message