spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ArtemisDev <arte...@dtechspace.com>
Subject Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4
Date Mon, 29 Jun 2020 14:43:56 GMT
Could you share your code?  Are you sure you Spark 2.4 cluster had 
indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>
> I have large amount of json files that Spark can read in 36 seconds 
> but Spark 3.0 takes almost 33 minutes to read the same. On closer 
> analysis, looks like Spark 3.0 is choosing different DAG than Spark 
> 2.0. Does anyone have any idea what is going on? Is there any 
> configuration problem with Spark 3.0.
>
> Here are the details:
>
> *Spark 2.4*
>
>
>         Summary Metrics for 2203 Completed Tasks
>         <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>
> Metric 	Min 	25th percentile 	Median 	75th percentile 	Max
> Duration 	0.0 ms 	0.0 ms 	0.0 ms 	1.0 ms 	62.0 ms
> GC Time 	0.0 ms 	0.0 ms 	0.0 ms 	0.0 ms 	11.0 ms
>
> Showing 1 to 2 of 2 entries
>
>
>         Aggregated Metrics by Executor
>
>
> Show  entries
> Search:
> Executor ID 	Logs 	Address 	Task Time 	Total Tasks 	Failed Tasks 
> Killed Tasks 	Succeeded Tasks 	Blacklisted
> driver 	
> 	10.0.0.8:49159 <http://10.0.0.8:49159> 	36 s 	2203 	0 	0 	2203 	false
>
>
>
> *Spark 3.0*
>
>
>         Summary Metrics for 8 Completed Tasks
>         <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>
> Metric 	Min 	25th percentile 	Median 	75th percentile 	Max
> Duration 	3.8 min 	4.0 min 	4.1 min 	4.4 min 	5.0 min
> GC Time 	3 s 	3 s 	3 s 	4 s 	4 s
> Input Size / Records 	15.6 MiB / 51028 	16.2 MiB / 53303 	16.8 MiB / 
> 55259 	17.8 MiB / 58148 	20.2 MiB / 71624
>
> Showing 1 to 3 of 3 entries
>
>
>         Aggregated Metrics by Executor
>
>
> Show  entries
> Search:
> Executor ID 	Logs 	Address 	Task Time 	Total Tasks 	Failed Tasks 
> Killed Tasks 	Succeeded Tasks 	Blacklisted 	Input Size / Records
> driver 	
> 	10.0.0.8:50224 <http://10.0.0.8:50224> 	33 min 	8 	0 	0 	8 	false 
> 136.1 MiB / 451999
>
>
>
> The DAG is also different
> Spark 2.0 DAG
>
> Screenshot 2020-06-27 16.30.26.png
>
> Spark 3.0 DAG
>
> Screenshot 2020-06-27 16.32.32.png
>
>

Mime
View raw message