spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
Date Wed, 01 Jul 2020 11:39:02 GMT

     [ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-32130:
------------------------------------

    Assignee: Apache Spark

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --------------------------------------------------------------------------
>
>                 Key: SPARK-32130
>                 URL: https://issues.apache.org/jira/browse/SPARK-32130
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.0.0
>         Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, sanjeevs-MacBook-Pro-2.local
resolves to a loopback address: 127.0.0.1; using 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting
port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>  ____ __
>  / __/__ ___ _____/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>            Reporter: Sanjeev Mishra
>            Assignee: Apache Spark
>            Priority: Critical
>         Attachments: SPARK 32130 - replication and findings.ipynb, small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files is unacceptable.
Following is the performance numbers when compared to Spark 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff0000}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#0000ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since
it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff0000}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#0000ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to reproduce the issue)
that is much smaller than the actual size but the performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self contained
json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple .json.gz files
whose contents will look similar to following
>  
> {quote}{color:#0000ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color}
> {quote}
> {quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}{color}
> {quote}
> {quote}{color:#0000ff}{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports
Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total
Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW
- NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}{color}
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message