spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joshuata <>
Subject Execute function once on each node
Date Mon, 18 Jul 2016 21:57:42 GMT
I am working on a spark application that requires the ability to run a
function on each node in the cluster. This is used to read data from a
directory that is not globally accessible to the cluster. I have tried
creating an RDD with n elements and n partitions so that it is evenly
distributed among the n nodes, and then mapping a function over the RDD.
However, the runtime makes no guarantees that each partition will be stored
on a separate node. This means that the code will run multiple times on the
same node while never running on another.

I have looked through the documentation and source code for both RDDs and
the scheduler, but I haven't found anything that will do what I need. Does
anybody know of a solution I could use?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message