spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Limiting number of cores per job in multi-threaded driver.
Date Sun, 04 Oct 2015 16:38:53 GMT
You are absolutely correct, I apologize.

My understanding was that you are sharing the machine across many jobs. That was the context
in which I was making that comment.

-adrian

Sent from my iPhone

On 03 Oct 2015, at 07:03, Philip Weaver <philip.weaver@gmail.com<mailto:philip.weaver@gmail.com>>
wrote:

You can't really say 8 cores is not much horsepower when you have no idea what my use case
is. That's silly.

On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <atanase@adobe.com<mailto:atanase@adobe.com>>
wrote:
Forgot to mention that you could also restrict the parallelism to 4, essentially using only
4 cores at any given time, however if your job is complex, a stage might be broken into more
than 1 task...

Sent from my iPhone

On 19 Sep 2015, at 08:30, Adrian Tanase <atanase@adobe.com<mailto:atanase@adobe.com>>
wrote:

Reading through the docs it seems that with a combination of FAIR scheduler and maybe pools
you can get pretty far.

However the smallest unit of scheduled work is the task so probably you need to think about
the parallelism of each transformation.

I'm guessing that by increasing the level of parallelism you get many smaller tasks that the
scheduler can then run across the many jobs you might have - as opposed to fewer, longer tasks...

Lastly, 8 cores is not that much horsepower :)
You may consider running with beefier machines or a larger cluster, to get at least tens of
cores.

Hope this helps,
-adrian

Sent from my iPhone

On 18 Sep 2015, at 18:37, Philip Weaver <philip.weaver@gmail.com<mailto:philip.weaver@gmail.com>>
wrote:

Here's a specific example of what I want to do. My Spark application is running with total-executor-cores=8.
A request comes in, it spawns a thread to handle that request, and starts a job. That job
should use only 4 cores, not all 8 of the cores available to the cluster.. When the first
job is scheduled, it should take only 4 cores, not all 8 of the cores that are available to
the driver.

Is there any way to accomplish this? This is on mesos.

In order to support the use cases described in https://spark.apache.org/docs/latest/job-scheduling.html,
where a spark application runs for a long time and handles requests from multiple users, I
believe what I'm asking about is a very important feature. One of the goals is to get lower
latency for each request, but if the first request takes all resources and we can't guarantee
any free resources for the second request, then that defeats the purpose. Does that make sense?

Thanks in advance for any advice you can provide!

- Philip

On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <philip.weaver@gmail.com<mailto:philip.weaver@gmail.com>>
wrote:
I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR scheduler, so I can
define a long-running application capable of executing multiple simultaneous spark jobs.

The kind of jobs that I'm running do not benefit from more than 4 cores, but I want my application
to be able to take several times that in order to run multiple jobs at the same time.

I suppose my question is more basic: How can I limit the number of cores used to load an RDD
or DataFrame? I can immediately repartition or coalesce my RDD or DataFrame to 4 partitions
after I load it, but that doesn't stop Spark from using more cores to load it.

Does it make sense what I am trying to accomplish, and is there any way to do it?

- Philip




Mime
View raw message