spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: How can we control CPU and Memory per Spark job operation..
Date Sun, 17 Jul 2016 05:18:11 GMT
You could call map on an RDD which has “many” partitions, then call repartition/coalesce
to drastically reduce the number of partitions so that your second map job has less things
running.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 16, 2016 at 4:46:04 PM, Jacek Laskowski (jacek@japila.pl) wrote:

Hi,  

My understanding is that these two map functions will end up as a job  
with one stage (as if you wrote the two maps as a single map) so you  
really need as much vcores and memory as possible for map1 and map2. I  
initially thought about dynamic allocation of executors that may or  
may not help you with the case, but since there's just one stage I  
don't think you can do much.  

Pozdrawiam,  
Jacek Laskowski  
----  
https://medium.com/@jaceklaskowski/  
Mastering Apache Spark http://bit.ly/mastering-apache-spark  
Follow me at https://twitter.com/jaceklaskowski  


On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <pachanta@sysomos.com> wrote:  
> Hi All,  
>  
> Here is my use case:  
>  
> I have a pipeline job consisting of 2 map functions:  
>  
> CPU intensive map operation that does not require a lot of memory.  
> Memory intensive map operation that requires upto 4 GB of memory. And this  
> 4GB memory cannot be distributed since it is an NLP model.  
>  
> Ideally what I like to do is to use 20 nodes with 4 cores each and minimal  
> memory for first map operation and then use only 3 nodes with minimal CPU  
> but each having 4GB of memory for 2nd operation.  
>  
> While it is possible to control this parallelism for each map operation in  
> spark. I am not sure how to control the resources for each operation.  
> Obviously I don’t want to start off the job with 20 nodes with 4 cores and  
> 4GB memory since I cannot afford that much memory.  
>  
> We use Yarn with Spark. Any suggestions ?  
>  
> Thanks and regards,  
> Pavan  
>  
>  

---------------------------------------------------------------------  
To unsubscribe e-mail: user-unsubscribe@spark.apache.org  


Mime
View raw message