spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Achanta <>
Subject How can we control CPU and Memory per Spark job operation..
Date Fri, 15 Jul 2016 19:54:44 GMT
Hi All,

Here is my use case:

I have a pipeline job consisting of 2 map functions:

  1.  CPU intensive map operation that does not require a lot of memory.
  2.  Memory intensive map operation that requires upto 4 GB of memory. And this 4GB memory
cannot be distributed since it is an NLP model.

Ideally what I like to do is to use 20 nodes with 4 cores each and minimal memory for first
map operation and then use only 3 nodes with minimal CPU but each having 4GB of memory for
2nd operation.

While it is possible to control this parallelism for each map operation in spark. I am not
sure how to control the resources for each operation. Obviously I don't want to start off
the job with 20 nodes with 4 cores and 4GB memory since I cannot afford that much memory.

We use Yarn with Spark. Any suggestions ?

Thanks and regards,

View raw message