spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4
Date Tue, 30 Jun 2020 12:37:37 GMT
Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA
tickets is it correct?


Regards,
Gourav Sengupta

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <sanjeev.mishra@gmail.com>
wrote:

> There is not much code, I am just using spark-shell and reading the data
> like so
>
> spark.time(spark.read.json("/data/small-anon/"))
>
> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
> Hi Sanjeev,
>
> can you share the exact code that you are using to read the JSON files?
> Currently I am getting only 11 records from the tar file that you have
> attached with JIRA.
>
> Regards,
> Gourav Sengupta
>
> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <artemis@dtechspace.com> wrote:
>
>> According to the spec, in addition to the line breaks, you should also
>> put the nested object values in arrays instead of dictionaries.  You may
>> want to give a try and see if this would give you a better performance.
>>
>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
>> 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.
>>
>> -- ND
>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>
>> The tar file that I have attached has bunch of json.zip files and this is
>> the file that is being processed. Each line is self contained JSON as shown
>> below
>>
>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
>>>> head -3
>>>
>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>> HNC IGD","Annex F
>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>>> Arris FastPath Speed
>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>> HNC IGD
>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>>> Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First
>>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
>>>
>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>> HNC IGD","Annex F
>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
>>>> Arris FastPath Speed
>>>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>> HNC IGD
>>>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
>>>> Control","GPON_100M_100M","Self-Service
>>>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
>>>
>>> {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>> HNC IGD","Annex F
>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports
>>>> Arris FastPath Speed
>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","
>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>> HNC IGD
>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total
>>>> Control","HSI","Self-Service
>>>> Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx -
>>>> First Contact","GPON_200M_200M","TCW
>>>> Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}
>>>
>>>
>> On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <artemis@dtechspace.com>
>> wrote:
>>
>>> Could you please share your input file instead of output files on that
>>> ticket?   Not sure if you were following the specific file format
>>> requirement for JSON in Spark.  The following is a snippet from the Spark
>>> online doc:
>>>
>>> Note that the file that is offered as *a json file* is not a typical
>>> JSON file. Each line must contain a separate, self-contained valid JSON
>>> object. For more information, please see JSON Lines text format, also
>>> called newline-delimited JSON <http://jsonlines.org/>.
>>>
>>> For a regular multi-line JSON file, set the multiLine option to true.
>>>
>>> -- ND
>>> On 6/29/20 11:55 AM, Sanjeev Mishra wrote:
>>>
>>> Done. https://issues.apache.org/jira/browse/SPARK-32130
>>>
>>>
>>>
>>> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.gekk@databricks.com>
>>> wrote:
>>>
>>>> Hello Sanjeev,
>>>>
>>>> It is hard to troubleshoot the issue without input files. Could you
>>>> open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and
>>>> attach the JSON files there (or samples or code which generates JSON
>>>> files)?
>>>>
>>>> Maxim Gekk
>>>> Software Engineer
>>>> Databricks, Inc.
>>>>
>>>>
>>>> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <
>>>> sanjeev.mishra@gmail.com> wrote:
>>>>
>>>>> It has read everything. As you notice the timing of count is still
>>>>> smaller in Spark 2.4
>>>>>
>>>>> Spark 2.4
>>>>>
>>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>>> Time taken: 19691 ms
>>>>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string
>>>>> ... 5 more fields]
>>>>>
>>>>> scala> spark.time(res61.count())
>>>>> Time taken: 7113 ms
>>>>> res64: Long = 2605349
>>>>>
>>>>> Spark 3.0
>>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>>> 20/06/29 08:06:53 WARN package: Truncated the string representation of
>>>>> a plan since it was too large. This behavior can be adjusted by setting
>>>>> 'spark.sql.debug.maxToStringFields'.
>>>>> Time taken: 849652 ms
>>>>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string
>>>>> ... 5 more fields]
>>>>>
>>>>> scala> spark.time(res0.count())
>>>>> Time taken: 8201 ms
>>>>> res2: Long = 2605349
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <artemis@dtechspace.com>
>>>>> wrote:
>>>>>
>>>>>> Could you share your code?  Are you sure you Spark 2.4 cluster had
>>>>>> indeed read anything?  Looks like the Input size field is empty under
2.4.
>>>>>>
>>>>>> -- ND
>>>>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>>>>>
>>>>>>
>>>>>> I have large amount of json files that Spark can read in 36 seconds
>>>>>> but Spark 3.0 takes almost 33 minutes to read the same. On closer
analysis,
>>>>>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does
anyone
>>>>>> have any idea what is going on? Is there any configuration problem
with
>>>>>> Spark 3.0.
>>>>>>
>>>>>> Here are the details:
>>>>>>
>>>>>> *Spark 2.4*
>>>>>>
>>>>>> Summary Metrics for 2203 Completed Tasks
>>>>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>>>>>> Metric Min 25th percentile Median 75th percentile Max
>>>>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>>>>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>>>>>> Showing 1 to 2 of 2 entries
>>>>>>   Aggregated Metrics by Executor
>>>>>> Show  20 40 60 100 All  entries
>>>>>> Search:
>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed
>>>>>> Tasks Succeeded Tasks Blacklisted
>>>>>> driver
>>>>>> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>>>>>
>>>>>>
>>>>>> *Spark 3.0*
>>>>>>
>>>>>> Summary Metrics for 8 Completed Tasks
>>>>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>>>>>> Metric Min 25th percentile Median 75th percentile Max
>>>>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>>>>>> GC Time 3 s 3 s 3 s 4 s 4 s
>>>>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB /
>>>>>> 55259 17.8 MiB / 58148 20.2 MiB / 71624
>>>>>> Showing 1 to 3 of 3 entries
>>>>>>   Aggregated Metrics by Executor
>>>>>> Show  20 40 60 100 All  entries
>>>>>> Search:
>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed
>>>>>> Tasks Succeeded Tasks Blacklisted Input Size / Records
>>>>>> driver
>>>>>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>>>>>
>>>>>>
>>>>>> The DAG is also different
>>>>>> Spark 2.0 DAG
>>>>>>
>>>>>> <Screenshot 2020-06-27 16.30.26.png>
>>>>>>
>>>>>> Spark 3.0 DAG
>>>>>>
>>>>>> <Screenshot 2020-06-27 16.32.32.png>
>>>>>>
>>>>>>
>>>>>>
>

Mime
View raw message