spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: balancing RDDs
Date Tue, 24 Jun 2014 22:41:11 GMT
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Jun 24, 2014 at 4:10 AM, Sean McNamara <Sean.McNamara@webtrends.com>
wrote:

> We have a use case where we’d like something to execute once on each node
> and I thought it would be good to ask here.
>
> Currently we achieve this by setting the parallelism to the number of
> nodes and use a mod partitioner:
>
> val balancedRdd = sc.parallelize(
>         (0 until Settings.parallelism)
>         .map(id => id -> Settings.settings)
>       ).partitionBy(new ModPartitioner(Settings.parallelism))
>       .cache()
>
>
> This works great except in two instances where it can become unbalanced:
>
> 1. if a worker is restarted or dies, the partition will move to a
> different node (one of the nodes will run two tasks).  When the worker
> rejoins, is there a way to have a partition move back over to the newly
> restarted worker so that it’s balanced again?
>
> 2. drivers need to be started in a staggered fashion, otherwise one driver
> can launch two tasks on one set of workers, and the other driver will do
> the same with the other set.  Are there any scheduler/config semantics so
> that each driver will take one (and only one) core from *each* node?
>
>
> Thanks
>
> Sean
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message