spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <>
Subject Re: How can we control CPU and Memory per Spark job operation..
Date Fri, 22 Jul 2016 13:20:00 GMT
Sorry, wasn’t very clear (looks like Pavan’s response was dropped from list for some reason
as well).

I am assuming that:
1) the first map is CPU bound
2) the second map is heavily memory bound

To be specific, lets saw you are using 4 m3.2xlarge instances which have 8 CPUs and 30GB of
ram each for a total of 32 cores and 120GB of ram. Since the NLP model can’t be distributed
that means every worker/core must use 4GB of RAM. If the cluster is fully utilized that means
that just for the NLP model you are consuming 32 * 4GB = 128GB of ram. The cluster at this
point is out of memory just for the NLP model not considering the data set itself. My suggestion
would be see if r3.8xlarge instances will work (or even X1s if you have access) since the
cpu/memory fraction is better. Here is the “hack” I proposed in more detail (basically
n partitions < total cores):

1) have the first map have a regular number of partitions, suppose 32 * 4 = 128 which is a
reasonable starting place
2) repartition immediately after that map to 16 partitions. At this point, spark is not guaranteed
to distributed you work evenly across the 4 nodes, but it probably will. The net result is
that half the CPU cores are idle, but the NLP model is at worse using 16 * 4GB = 64GB of RAM.
To be sure, this is a hack since the nodes being evenly distributed work is not guaranteed. 

If you wanted to do this as not a hack, you could perform the map, checkpoint your work, end
the job, then submit a new job where the cpu/memory ratio is more favorable which reads from
the prior job’s output. I am guessing this heavily depends on how expensive reloading the
data set from disk/network is. 

Hopefully one of these helps,
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni | 909-353-4423 | LinkedIn

On July 17, 2016 at 6:16:41 AM, Jacek Laskowski ( wrote:


How would that help?! Why would you do that?


On 17 Jul 2016 7:19 a.m., "Pedro Rodriguez" <> wrote:
You could call map on an RDD which has “many” partitions, then call repartition/coalesce
to drastically reduce the number of partitions so that your second map job has less things

Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni | 909-353-4423 | LinkedIn

On July 16, 2016 at 4:46:04 PM, Jacek Laskowski ( wrote:


My understanding is that these two map functions will end up as a job
with one stage (as if you wrote the two maps as a single map) so you
really need as much vcores and memory as possible for map1 and map2. I
initially thought about dynamic allocation of executors that may or
may not help you with the case, but since there's just one stage I
don't think you can do much.

Jacek Laskowski
Mastering Apache Spark
Follow me at

On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <> wrote:
> Hi All,
> Here is my use case:
> I have a pipeline job consisting of 2 map functions:
> CPU intensive map operation that does not require a lot of memory.
> Memory intensive map operation that requires upto 4 GB of memory. And this
> 4GB memory cannot be distributed since it is an NLP model.
> Ideally what I like to do is to use 20 nodes with 4 cores each and minimal
> memory for first map operation and then use only 3 nodes with minimal CPU
> but each having 4GB of memory for 2nd operation.
> While it is possible to control this parallelism for each map operation in
> spark. I am not sure how to control the resources for each operation.
> Obviously I don’t want to start off the job with 20 nodes with 4 cores and
> 4GB memory since I cannot afford that much memory.
> We use Yarn with Spark. Any suggestions ?
> Thanks and regards,
> Pavan

To unsubscribe e-mail:

View raw message