spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sriram Bhamidipati <sriram...@gmail.com>
Subject Re: How to estimate the rdd size before the rdd result is written to disk
Date Fri, 20 Dec 2019 06:25:58 GMT
Hello Experts
I am trying to maximise the resource utilisation on my 3 node spark cluster
(2 data nodes and 1 driver) so that the job finishes quickest. I am trying
to create a benchmark so I can recommend an optimal POD for the job
128GB x 16 cores
I have standalone spark running 2.4.0
HTOP shows only half of the memory is in use. So what will be alternatives
I can try? CPU is always 100 % for the allocated resources
I can reduce per executor memory to 32 GB and increase number of executors?
I have the following properties:

spark.driver.maxResultSize 64g
spark.driver.memory 100g
spark.driver.port 33631
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 60s
spark.executor.cores 8
spark.executor.id driver
spark.executor.instances 4
spark.executor.memory 64g
spark.files file://dist/xxxx-0.0.1-py3.7.egg
spark.locality.wait 10s

100
spark.shuffle.service.enabled true

On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <kellyzly@126.com> wrote:

> Hi all:
>  i want to ask a question  about how to estimate the rdd size( according
> to byte) when it is not saved to disk because the job spends long time if
> the output is very huge and output partition number is small.
>
>
> following step is  what i can solve for this problem
>
>  1.sample 0.01 's original data
>
>  2.compute sample data count
>
>  3. if sample data count >0, cache the sample data  and compute sample
> data size
>
>  4.compute original rdd total count
>
>  5.estimate the rdd size as ${total count}* ${sampel data size}  /
> ${sample rdd count}
>
> The code is here
> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24>
> .
>
> My question
> 1. can i use above way to solve the problem?   If can not, where is wrong?
> 2. Is there any existed solution ( existed API in spark) to solve the
> problem?
>
>
>
> Best Regards
> Kelly Zhang
>
>
>
>


-- 
-Sriram

Mime
View raw message