spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4
Date Mon, 29 Jun 2020 09:25:32 GMT
Hi,

can you please share the SPARK code?



Regards,
Gourav

On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra <sanjeev.mishra@gmail.com>
wrote:

>
> I have large amount of json files that Spark can read in 36 seconds but
> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
> have any idea what is going on? Is there any configuration problem with
> Spark 3.0.
>
> Here are the details:
>
> *Spark 2.4*
>
> Summary Metrics for 2203 Completed Tasks
> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
> MetricMin25th percentileMedian75th percentileMax
> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
> Showing 1 to 2 of 2 entries
>  Aggregated Metrics by Executor
> Show 204060100All entries
> Search:
> Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
> TasksBlacklisted
> driver 10.0.0.8:49159 36 s 2203 0 0 2203 false
>
>
> *Spark 3.0*
>
> Summary Metrics for 8 Completed Tasks
> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
> MetricMin25th percentileMedian75th percentileMax
> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
> GC Time 3 s 3 s 3 s 4 s 4 s
> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
> MiB / 58148 20.2 MiB / 71624
> Showing 1 to 3 of 3 entries
>  Aggregated Metrics by Executor
> Show 204060100All entries
> Search:
> Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
> TasksBlacklistedInput Size / Records
> driver 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>
>
> The DAG is also different
> Spark 2.0 DAG
>
> [image: Screenshot 2020-06-27 16.30.26.png]
>
> Spark 3.0 DAG
>
> [image: Screenshot 2020-06-27 16.32.32.png]
>
>
>

Mime
View raw message