spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <daniel.siegm...@velos.io>
Subject Re: Number of partitions and Number of concurrent tasks
Date Fri, 01 Aug 2014 20:21:46 GMT
It is definitely possible to run multiple workers on a single node and have
each worker with the maximum number of cores (e.g. if you have 8 cores and
2 workers you'd have 16 cores per node). I don't know if it's possible with
the out of the box scripts though.

It's actually not really that difficult. You just run start-slave.sh
multiple times on the same node, with different IDs. Here is the usage:

# Usage: start-slave.sh <worker#> <master-spark-URL>

But we have custom scripts to do that. I'm not sure whether it is possible
using the standard start-all.sh script or that EC2 script. Probably not.

I haven't set up or managed such a cluster myself, so that's about the
extent of my knowledge. But I've deployed jobs to that cluster and enjoyed
the benefit of double the cores - we had a fair amount of I/O though, which
may be why it helped in our case. I recommend taking a look at the CPU
utilization on the nodes when running a flow before jumping through these
hoops.


On Fri, Aug 1, 2014 at 12:05 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Darin,
>
> I think the number of cores in your cluster is a hard limit on how many
> concurrent tasks you can execute at one time. If you want more parallelism,
> I think you just need more cores in your cluster--that is, bigger nodes, or
> more nodes.
>
> Daniel,
>
> Have you been able to get around this limit?
>
> Nick
>
>
>
> On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann <daniel.siegmann@velos.io
> > wrote:
>
>> Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem
>> could be. Hopefully someone else will be able to help. The only thing I
>> could suggest is to try setting both the worker instances and the number of
>> cores (assuming spark-ec2 has such a parameter).
>>
>>
>> On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <ddmcbeath@yahoo.com>
>> wrote:
>>
>>> Ok, I set the number of spark worker instances to 2 (below is my startup
>>> command).  But, this essentially had the effect of increasing my number of
>>> workers from 3 to 6 (which was good) but it also reduced my number of cores
>>> per worker from 8 to 4 (which was not so good).  In the end, I would still
>>> only be able to concurrently process 24 partitions in parallel.  I'm
>>> starting a stand-alone cluster using the spark provided ec2 scripts .  I
>>> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py
>>> but this had no effect. So, it's not clear if I could even set the
>>> SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is
>>> anything else I can try but at least wanted to document what I did try and
>>> the net effect.  I'm open to any suggestions/advice.
>>>
>>>  ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3
>>> -t m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2
>>> *my-cluster*
>>>
>>>
>>>   ------------------------------
>>>  *From:* Daniel Siegmann <daniel.siegmann@velos.io>
>>> *To:* Darin McBeath <ddmcbeath@yahoo.com>
>>> *Cc:* Daniel Siegmann <daniel.siegmann@velos.io>; "user@spark.apache.org"
>>> <user@spark.apache.org>
>>> *Sent:* Thursday, July 31, 2014 10:04 AM
>>>
>>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>>
>>> I haven't configured this myself. I'd start with setting
>>> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than
>>> adding more workers. This defaults to "all available cores" according to
>>> the documentation, so I'm not sure if you can actually set it higher. If
>>> not, you can get around this by adding more worker instances; I believe
>>> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.
>>>
>>> I don't think you *have* to set the cores if you have more workers - it
>>> will default to 8 cores per worker (in your case). But maybe 16 cores per
>>> node will be too many. You'll have to test. Keep in mind that more workers
>>> means more memory and such too, so you may need to tweak some other
>>> settings downward in this case.
>>>
>>> On a side note: I've read some people found performance was better when
>>> they had more workers with less memory each, instead of a single worker
>>> with tons of memory, because it cut down on garbage collection time. But I
>>> can't speak to that myself.
>>>
>>> In any case, if you increase the number of cores available in your
>>> cluster (whether per worker, or adding more workers per node, or of course
>>> adding more nodes) you should see more tasks running concurrently. Whether
>>> this will actually be *faster* probably depends mainly on whether the
>>> CPUs in your nodes were really being fully utilized with the current number
>>> of cores.
>>>
>>>
>>> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbeath@yahoo.com>
>>> wrote:
>>>
>>> Thanks.
>>>
>>>  So to make sure I understand.  Since I'm using a 'stand-alone'
>>> cluster, I would set SPARK_WORKER_INSTANCES to something like 2
>>> (instead of the default value of 1).  Is that correct?  But, it also sounds
>>> like I need to explicitly set a value for SPARKER_WORKER_CORES (based on
>>> what the documentation states).  What would I want that value to be based
>>> on my configuration below?  Or, would I leave that alone?
>>>
>>>   ------------------------------
>>>  *From:* Daniel Siegmann <daniel.siegmann@velos.io>
>>> *To:* user@spark.apache.org; Darin McBeath <ddmcbeath@yahoo.com>
>>> *Sent:* Wednesday, July 30, 2014 5:58 PM
>>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>>
>>> This is correct behavior. Each "core" can execute exactly one task at a
>>> time, with each task corresponding to a partition. If your cluster only has
>>> 24 cores, you can only run at most 24 tasks at once.
>>>
>>> You could run multiple workers per node to get more executors. That
>>> would give you more cores in the cluster. But however many cores you have,
>>> each core will run only one task at a time.
>>>
>>>
>>> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbeath@yahoo.com>
>>> wrote:
>>>
>>>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>>>
>>> I have an RDD<String> which I've repartitioned so it has 100 partitions
>>> (hoping to increase the parallelism).
>>>
>>> When I do a transformation (such as filter) on this RDD, I can't  seem
>>> to get more than 24 tasks (my total number of cores across the 3 nodes)
>>> going at one point in time.  By tasks, I mean the number of tasks that
>>> appear under the Application UI.  I tried explicitly setting the
>>> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
>>> running) and verified this in the Application UI for the running
>>> application but this had no effect.  Perhaps, this is ignored for a
>>> 'filter' and the default is the total number of cores available.
>>>
>>> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
>>> something fundamental.  Any help would be appreciated.
>>>
>>> Thanks.
>>>
>>> Darin.
>>>
>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>>
>>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Mime
View raw message