spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: How can we control CPU and Memory per Spark job operation..
Date Sun, 17 Jul 2016 12:16:39 GMT
Hi,

How would that help?! Why would you do that?

Jacek

On 17 Jul 2016 7:19 a.m., "Pedro Rodriguez" <ski.rodriguez@gmail.com> wrote:

> You could call map on an RDD which has “many” partitions, then call
> repartition/coalesce to drastically reduce the number of partitions so that
> your second map job has less things running.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 16, 2016 at 4:46:04 PM, Jacek Laskowski (jacek@japila.pl) wrote:
>
> Hi,
>
> My understanding is that these two map functions will end up as a job
> with one stage (as if you wrote the two maps as a single map) so you
> really need as much vcores and memory as possible for map1 and map2. I
> initially thought about dynamic allocation of executors that may or
> may not help you with the case, but since there's just one stage I
> don't think you can do much.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <pachanta@sysomos.com>
> wrote:
> > Hi All,
> >
> > Here is my use case:
> >
> > I have a pipeline job consisting of 2 map functions:
> >
> > CPU intensive map operation that does not require a lot of memory.
> > Memory intensive map operation that requires upto 4 GB of memory. And
> this
> > 4GB memory cannot be distributed since it is an NLP model.
> >
> > Ideally what I like to do is to use 20 nodes with 4 cores each and
> minimal
> > memory for first map operation and then use only 3 nodes with minimal
> CPU
> > but each having 4GB of memory for 2nd operation.
> >
> > While it is possible to control this parallelism for each map operation
> in
> > spark. I am not sure how to control the resources for each operation.
> > Obviously I don’t want to start off the job with 20 nodes with 4 cores
> and
> > 4GB memory since I cannot afford that much memory.
> >
> > We use Yarn with Spark. Any suggestions ?
> >
> > Thanks and regards,
> > Pavan
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message