spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ArtemisDev <arte...@dtechspace.com>
Subject Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4
Date Mon, 29 Jun 2020 16:17:51 GMT
Could you please share your input file instead of output files on that 
ticket?   Not sure if you were following the specific file format 
requirement for JSON in Spark.  The following is a snippet from the 
Spark online doc:

Note that the file that is offered as/a json file/is not a typical JSON 
file. Each line must contain a separate, self-contained valid JSON 
object. For more information, please seeJSON Lines text format, also 
called newline-delimited JSON <http://jsonlines.org/>.

For a regular multi-line JSON file, set the|multiLine|option to|true|.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:
> Done. https://issues.apache.org/jira/browse/SPARK-32130
>
>
>
> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.gekk@databricks.com 
> <mailto:maxim.gekk@databricks.com>> wrote:
>
>     Hello Sanjeev,
>
>     It is hard to troubleshoot the issue without input files. Could
>     you open an JIRA ticket at
>     https://issues.apache.org/jira/projects/SPARK and attach the JSON
>     files there (or samples or code which generates JSON files)?
>
>     Maxim Gekk
>
>     Software Engineer
>
>     Databricks, Inc.
>
>
>
>     On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra
>     <sanjeev.mishra@gmail.com <mailto:sanjeev.mishra@gmail.com>> wrote:
>
>         It has read everything. As you notice the timing of count is
>         still smaller in Spark 2.4
>
>         Spark 2.4
>
>         scala> spark.time(spark.read.json("/data/20200528"))
>         Time taken: 19691 ms
>         res61: org.apache.spark.sql.DataFrame = [created: bigint, id:
>         string ... 5 more fields]
>
>         scala> spark.time(res61.count())
>         Time taken: 7113 ms
>         res64: Long = 2605349
>
>         Spark 3.0
>         scala> spark.time(spark.read.json("/data/20200528"))
>         20/06/29 08:06:53 WARN package: Truncated the string
>         representation of a plan since it was too large. This behavior
>         can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
>         Time taken: 849652 ms
>         res0: org.apache.spark.sql.DataFrame = [created: bigint, id:
>         string ... 5 more fields]
>
>         scala> spark.time(res0.count())
>         Time taken: 8201 ms
>         res2: Long = 2605349
>
>
>
>
>         On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev
>         <artemis@dtechspace.com <mailto:artemis@dtechspace.com>> wrote:
>
>             Could you share your code?  Are you sure you Spark 2.4
>             cluster had indeed read anything?  Looks like the Input
>             size field is empty under 2.4.
>
>             -- ND
>
>             On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>
>>             I have large amount of json files that Spark can read in
>>             36 seconds but Spark 3.0 takes almost 33 minutes to read
>>             the same. On closer analysis, looks like Spark 3.0 is
>>             choosing different DAG than Spark 2.0. Does anyone have
>>             any idea what is going on? Is there any configuration
>>             problem with Spark 3.0.
>>
>>             Here are the details:
>>
>>             *Spark 2.4*
>>
>>
>>                     Summary Metrics for 2203 Completed Tasks
>>                     <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>>
>>             Metric 	Min 	25th percentile 	Median 	75th percentile 	Max
>>             Duration 	0.0 ms 	0.0 ms 	0.0 ms 	1.0 ms 	62.0 ms
>>             GC Time 	0.0 ms 	0.0 ms 	0.0 ms 	0.0 ms 	11.0 ms
>>
>>             Showing 1 to 2 of 2 entries
>>
>>
>>                     Aggregated Metrics by Executor
>>
>>
>>             Show  entries
>>             Search:
>>             Executor ID 	Logs 	Address 	Task Time 	Total Tasks
>>             Failed Tasks 	Killed Tasks 	Succeeded Tasks 	Blacklisted
>>             driver 	
>>             	10.0.0.8:49159 <http://10.0.0.8:49159> 	36 s 	2203 	0 	0
>>             	2203 	false
>>
>>
>>
>>             *Spark 3.0*
>>
>>
>>                     Summary Metrics for 8 Completed Tasks
>>                     <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>>
>>             Metric 	Min 	25th percentile 	Median 	75th percentile 	Max
>>             Duration 	3.8 min 	4.0 min 	4.1 min 	4.4 min 	5.0 min
>>             GC Time 	3 s 	3 s 	3 s 	4 s 	4 s
>>             Input Size / Records 	15.6 MiB / 51028 	16.2 MiB / 53303
>>             16.8 MiB / 55259 	17.8 MiB / 58148 	20.2 MiB / 71624
>>
>>             Showing 1 to 3 of 3 entries
>>
>>
>>                     Aggregated Metrics by Executor
>>
>>
>>             Show  entries
>>             Search:
>>             Executor ID 	Logs 	Address 	Task Time 	Total Tasks
>>             Failed Tasks 	Killed Tasks 	Succeeded Tasks 	Blacklisted
>>             Input Size / Records
>>             driver 	
>>             	10.0.0.8:50224 <http://10.0.0.8:50224> 	33 min 	8 	0 	0
>>             8 	false 	136.1 MiB / 451999
>>
>>
>>
>>             The DAG is also different
>>             Spark 2.0 DAG
>>
>>             Screenshot 2020-06-27 16.30.26.png
>>
>>             Spark 3.0 DAG
>>
>>             Screenshot 2020-06-27 16.32.32.png
>>
>>

Mime
View raw message