spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahbaz <shahzadh...@gmail.com>
Subject Re: How to address seemingly low core utilization on a spark workload?
Date Thu, 15 Nov 2018 17:49:05 GMT
30k Sql shuffle partitions is extremely high.Core to Partition is 1 to  1
,default value of Sql shuffle partitions is  200 ,set it to 300 or leave it
to default ,see which one gives best performance,after you do that ,see how
cores are being used?

Regards,
Shahbaz

On Thu, Nov 15, 2018 at 10:58 PM Vitaliy Pisarev <
vitaliy.pisarev@biocatch.com> wrote:

> Oh, regarding and shuffle.partitions being 30k, don't know. I inherited
> the workload from an engineer that is no longer around and am trying to
> make sense of things in general.
>
> On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev <
> vitaliy.pisarev@biocatch.com> wrote:
>
>> The quest is dual:
>>
>>
>>    - Increase utilisation- because cores cost money and I want to make
>>    sure that if I fully utilise what I pay for. This is very blunt of corse,
>>    because there is always i/o and at least some degree of skew. Bottom line
>>    is do the same thing over the same time but with fewer (but better
>>    utilised) resources.
>>    - Reduce runtime by increasing parallelism.
>>
>> While not the same, I am looking at these as two sides of the same coin.
>>
>>
>>
>>
>>
>> On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh <
>> jthakrar@conversantmedia.com> wrote:
>>
>>> For that little data, I find spark.sql.shuffle.partitions = 30000 to be
>>> very high.
>>>
>>> Any reason for that high value?
>>>
>>>
>>>
>>> Do you have a baseline observation with the default value?
>>>
>>>
>>>
>>> Also, enabling the jobgroup and job info through the API and observing
>>> through the UI will help you understand the code snippets when you have low
>>> utilization.
>>>
>>>
>>>
>>> Finally, high utilization does not equate to high efficiency.
>>>
>>> Its very likely that for your workload, you may only need 16-128
>>> executors.
>>>
>>> I would suggest getting the partition count for the various
>>> datasets/dataframes/rdds in your code by using
>>>
>>>
>>>
>>> dataset.rdd. getNumPartitions
>>>
>>>
>>>
>>> I would also suggest doing a number of tests with different number of
>>> executors too.
>>>
>>>
>>>
>>> But coming back to the objective behind your quest – are you trying to
>>> maximize utilization hoping that by having high parallelism will reduce
>>> your total runtime?
>>>
>>>
>>>
>>>
>>>
>>> *From: *Vitaliy Pisarev <vitaliy.pisarev@biocatch.com>
>>> *Date: *Thursday, November 15, 2018 at 10:07 AM
>>> *To: *<jthakrar@conversantmedia.com>
>>> *Cc: *user <user@spark.apache.org>, David Markovitz <
>>> Dudu.Markovitz@microsoft.com>
>>> *Subject: *Re: How to address seemingly low core utilization on a spark
>>> workload?
>>>
>>>
>>>
>>> I am working with parquets and the metadata reading there is quite fast
>>> as there are at most 16 files (a couple of gigs each).
>>>
>>>
>>>
>>> I find it very hard to answer the question: "how many partitions do you
>>> have?", many spark operations do not preserve partitioning and I have a lot
>>> of filtering and grouping going on.
>>>
>>> What I *can* say is that I specified spark.sql.shuffle.partitions to
>>> 30,000.
>>>
>>>
>>>
>>> I am not worried that there are not enough partitions to keep the cores
>>> working. Having said that I do see that the high utilisation correlates
>>> heavily with shuffle read/write. Whereas low utilisation correlates with no
>>> shuffling.
>>>
>>> This leads me to the conclusion that compared to the amount of
>>> shuffling, the cluster is doing very little work.
>>>
>>>
>>>
>>> Question is what can I do about it.
>>>
>>>
>>>
>>> On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh <
>>> jthakrar@conversantmedia.com> wrote:
>>>
>>> Can you shed more light on what kind of processing you are doing?
>>>
>>>
>>>
>>> One common pattern that I have seen for active core/executor utilization
>>> dropping to zero is while reading ORC data and the driver seems (I think)
>>> to be doing schema validation.
>>>
>>> In my case I would have hundreds of thousands of ORC data files and
>>> there is dead silence for about 1-2 hours.
>>>
>>> I have tried providing a schema and disabling schema validation while
>>> reading the ORC data, but that does not seem to help (Spark 2.2.1).
>>>
>>>
>>>
>>> And as you know, in most cases, there is a linear relationship between
>>> number of partitions in your data and the concurrently active executors.
>>>
>>>
>>>
>>> Another thing I would suggest is use the following two API calls/method
>>> – they will annotate the spark stages and jobs with what is being executed
>>> in the Spark UI.
>>>
>>> SparkContext.setJobGroup(….)
>>>
>>> SparkContext.setJobDescription(….)
>>>
>>>
>>>
>>> *From: *Vitaliy Pisarev <vitaliy.pisarev@biocatch.com>
>>> *Date: *Thursday, November 15, 2018 at 8:51 AM
>>> *To: *user <user@spark.apache.org>
>>> *Cc: *David Markovitz <Dudu.Markovitz@microsoft.com>
>>> *Subject: *How to address seemingly low core utilization on a spark
>>> workload?
>>>
>>>
>>>
>>> I have a workload that runs on a cluster of 300 cores.
>>>
>>> Below is a plot of the amount of active tasks over time during the
>>> execution of this workload:
>>>
>>>
>>>
>>> [image: image.png]
>>>
>>>
>>>
>>> What I deduce is that there are substantial intervals where the cores
>>> are heavily under-utilised.
>>>
>>>
>>>
>>> What actions can I take to:
>>>
>>>    - Increase the efficiency (== core utilisation) of the cluster?
>>>    - Understand the root causes behind the drops in core utilisation?
>>>
>>>

Mime
View raw message