spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <>
Subject Re: How to address seemingly low core utilization on a spark workload?
Date Thu, 15 Nov 2018 15:29:55 GMT
Can you shed more light on what kind of processing you are doing?

One common pattern that I have seen for active core/executor utilization dropping to zero
is while reading ORC data and the driver seems (I think) to be doing schema validation.
In my case I would have hundreds of thousands of ORC data files and there is dead silence
for about 1-2 hours.
I have tried providing a schema and disabling schema validation while reading the ORC data,
but that does not seem to help (Spark 2.2.1).

And as you know, in most cases, there is a linear relationship between number of partitions
in your data and the concurrently active executors.

Another thing I would suggest is use the following two API calls/method – they will annotate
the spark stages and jobs with what is being executed in the Spark UI.

From: Vitaliy Pisarev <>
Date: Thursday, November 15, 2018 at 8:51 AM
To: user <>
Cc: David Markovitz <>
Subject: How to address seemingly low core utilization on a spark workload?

I have a workload that runs on a cluster of 300 cores.
Below is a plot of the amount of active tasks over time during the execution of this workload:


What I deduce is that there are substantial intervals where the cores are heavily under-utilised.

What actions can I take to:

  *   Increase the efficiency (== core utilisation) of the cluster?
  *   Understand the root causes behind the drops in core utilisation?
View raw message