spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joshuata <joshaspl...@gmail.com>
Subject Execute function once on each node
Date Mon, 18 Jul 2016 21:57:42 GMT
I am working on a spark application that requires the ability to run a
function on each node in the cluster. This is used to read data from a
directory that is not globally accessible to the cluster. I have tried
creating an RDD with n elements and n partitions so that it is evenly
distributed among the n nodes, and then mapping a function over the RDD.
However, the runtime makes no guarantees that each partition will be stored
on a separate node. This means that the code will run multiple times on the
same node while never running on another.

I have looked through the documentation and source code for both RDDs and
the scheduler, but I haven't found anything that will do what I need. Does
anybody know of a solution I could use?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message