spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangliyun <>
Subject How to estimate the rdd size before the rdd result is written to disk
Date Fri, 20 Dec 2019 05:26:12 GMT
Hi all:
 i want to ask a question  about how to estimate the rdd size( according to byte) when it
is not saved to disk because the job spends long time if the output is very huge and output
partition number is small. 

following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd count}

The code is here.  

My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?

Best Regards
Kelly Zhang
View raw message