spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <>
Subject Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore
Date Sun, 20 Nov 2016 00:52:05 GMT
While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler.  If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
tasks: you may well end up with tasks being queued.

The suggestion is to try it out on the resource manager and scheduler being
used for your deployment. You may need to swap out their default scheduler
for a true round robin.

2016-11-19 16:44 GMT-08:00 Adam Smith <>:

> Dear community,
> I have a RDD with N rows and N partitions. I want to ensure that the
> partitions run all at the some time, by setting the number of vcores
> (spark-yarn) to N. The partitions need to talk to each other with some
> socket based sync that is why I need them to run more or less
> simultaneously.
> Let's assume no node will die. Will my setup guarantee that all partitions
> are computed in parallel?
> I know this is somehow hackish. Is there a better way doing so?
> My goal is replicate message passing (like OpenMPI) with spark, where I
> have very specific and final communcation requirements. So no need for the
> many comm and sync funtionality, just what I already have - sync and talk.
> Thanks!
> Adam

View raw message