spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: Execute function once on each node
Date Tue, 19 Jul 2016 14:08:47 GMT
The whole point of a well designed global filesystem is to not move the data

On Jul 19, 2016 10:07, "Koert Kuipers" <> wrote:

> If you run hdfs on those ssds (with low replication factor) wouldn't it
> also effectively write to local disk with low latency?
> On Jul 18, 2016 21:54, "Josh Asplund" <> wrote:
> The spark workers are running side-by-side with scientific simulation
> code. The code writes output to local SSDs to keep latency low. Due to the
> volume of data being moved (10's of terabytes +), it isn't really feasible
> to copy the data to a global filesystem. Executing a function on each node
> would allow us to read the data in situ without a copy.
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> wrote:
>> You can't assume that the number to nodes will be constant as some may
>> fail, hence you can't guarantee that a function will execute at most once
>> or atleast once on a node. Can you explain your use case in a bit more
>> detail?
>> On Mon, Jul 18, 2016, 10:57 PM joshuata <> wrote:
>>> I am working on a spark application that requires the ability to run a
>>> function on each node in the cluster. This is used to read data from a
>>> directory that is not globally accessible to the cluster. I have tried
>>> creating an RDD with n elements and n partitions so that it is evenly
>>> distributed among the n nodes, and then mapping a function over the RDD.
>>> However, the runtime makes no guarantees that each partition will be
>>> stored
>>> on a separate node. This means that the code will run multiple times on
>>> the
>>> same node while never running on another.
>>> I have looked through the documentation and source code for both RDDs and
>>> the scheduler, but I haven't found anything that will do what I need.
>>> Does
>>> anybody know of a solution I could use?
>>> --
>>> View this message in context:
>>> Sent from the Apache Spark User List mailing list archive at
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail:

View raw message