spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Asplund <joshaspl...@gmail.com>
Subject Re: Execute function once on each node
Date Tue, 19 Jul 2016 01:54:12 GMT
The spark workers are running side-by-side with scientific simulation code.
The code writes output to local SSDs to keep latency low. Due to the volume
of data being moved (10's of terabytes +), it isn't really feasible to copy
the data to a global filesystem. Executing a function on each node would
allow us to read the data in situ without a copy.

I understand that manually assigning tasks to nodes reduces fault
tolerance, but the simulation codes already explicitly assign tasks, so a
failure of any one node is already a full-job failure.

On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <aniket.bhatnagar@gmail.com>
wrote:

> You can't assume that the number to nodes will be constant as some may
> fail, hence you can't guarantee that a function will execute at most once
> or atleast once on a node. Can you explain your use case in a bit more
> detail?
>
> On Mon, Jul 18, 2016, 10:57 PM joshuata <joshasplund@gmail.com> wrote:
>
>> I am working on a spark application that requires the ability to run a
>> function on each node in the cluster. This is used to read data from a
>> directory that is not globally accessible to the cluster. I have tried
>> creating an RDD with n elements and n partitions so that it is evenly
>> distributed among the n nodes, and then mapping a function over the RDD.
>> However, the runtime makes no guarantees that each partition will be
>> stored
>> on a separate node. This means that the code will run multiple times on
>> the
>> same node while never running on another.
>>
>> I have looked through the documentation and source code for both RDDs and
>> the scheduler, but I haven't found anything that will do what I need. Does
>> anybody know of a solution I could use?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message