spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Asplund <>
Subject Re: Execute function once on each node
Date Tue, 19 Jul 2016 16:30:51 GMT
Technical limitations keep us from running another filesystem on the SSDs.
We are running on a very large HPC cluster without control over low-level
system components. We have tried setting up an ad-hoc HDFS cluster on the
nodes in our allocation, but we have had very little luck. It ends up being
very brittle and difficult for the simulation code to access.

On Tue, Jul 19, 2016 at 7:08 AM Koert Kuipers <> wrote:

> The whole point of a well designed global filesystem is to not move the
> data
> On Jul 19, 2016 10:07, "Koert Kuipers" <> wrote:
>> If you run hdfs on those ssds (with low replication factor) wouldn't it
>> also effectively write to local disk with low latency?
>> On Jul 18, 2016 21:54, "Josh Asplund" <> wrote:
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>>> wrote:
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata <> wrote:
>>>> I am working on a spark application that requires the ability to run a
>>>> function on each node in the cluster. This is used to read data from a
>>>> directory that is not globally accessible to the cluster. I have tried
>>>> creating an RDD with n elements and n partitions so that it is evenly
>>>> distributed among the n nodes, and then mapping a function over the RDD.
>>>> However, the runtime makes no guarantees that each partition will be
>>>> stored
>>>> on a separate node. This means that the code will run multiple times on
>>>> the
>>>> same node while never running on another.
>>>> I have looked through the documentation and source code for both RDDs
>>>> and
>>>> the scheduler, but I haven't found anything that will do what I need.
>>>> Does
>>>> anybody know of a solution I could use?
>>>> --
>>>> View this message in context:
>>>> Sent from the Apache Spark User List mailing list archive at
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail:

View raw message