spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4
Date Tue, 30 Jun 2020 18:18:28 GMT
Hi,

I think that the notebook clearly demonstrates that setting the
inferTimestamp option to False does not really help.

Is it really impossible for you to show how your own data can be loaded? It
should be simple, just open the notebook and see why the exact code you
have given does not work, and shows only 11 records.


Regards,
Gourav Sengupta

On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra <sanjeev.mishra@gmail.com>
wrote:

> Hi Gourav,
>
> Please check the comments of the ticket, looks like the performance
> degradation is attributed to inferTimestamp option that is true by default
> (I have no idea why) in Spark 3.0. This forces Spark to scan entire text
> and so the poor performance.
>
> Regards
> Sanjeev
>
> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
> Hi, Sanjeev,
>
> I think that I did precisely that, can you please download my ipython
> notebook and have a look, and let me know where I am going wrong. its
> attached with the JIRA ticket.
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <sanjeev.mishra@gmail.com>
> wrote:
>
>> There are total 11 files as part of tar. You will have to untar it to get
>> to actual files (.json.gz)
>>
>> No, I am getting
>>
>> Count: 33447
>>
>> spark.time(spark.read.json(“/data/small-anon/"))
>> Time taken: 431 ms
>> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>> 2 more fields]
>>
>> scala> res73.count()
>> res74: Long = 33447
>>
>> ls -ltr
>> total 7592
>> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40
>> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40
>> part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40
>> part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40
>> part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40
>> part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40
>> part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40
>> part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40
>> part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40
>> part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40
>> part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40
>> part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff       0 Jun 29 08:40 _SUCCESS
>>
>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
>> wrote:
>>
>> Hi Sanjeev,
>> that just gives 11 records from the sample that you have loaded to the
>> JIRA tickets is it correct?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <sanjeev.mishra@gmail.com>
>> wrote:
>>
>>> There is not much code, I am just using spark-shell and reading the data
>>> like so
>>>
>>> spark.time(spark.read.json("/data/small-anon/"))
>>>
>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
>>> wrote:
>>>
>>> Hi Sanjeev,
>>>
>>> can you share the exact code that you are using to read the JSON files?
>>> Currently I am getting only 11 records from the tar file that you have
>>> attached with JIRA.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <artemis@dtechspace.com>
>>> wrote:
>>>
>>>> According to the spec, in addition to the line breaks, you should also
>>>> put the nested object values in arrays instead of dictionaries.  You may
>>>> want to give a try and see if this would give you a better performance.
>>>>
>>>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
>>>> 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.
>>>>
>>>> -- ND
>>>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>>>
>>>> The tar file that I have attached has bunch of json.zip files and this
>>>> is the file that is being processed. Each line is self contained JSON as
>>>> shown below
>>>>
>>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
>>>>>> head -3
>>>>>
>>>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>>>> HNC IGD","Annex F
>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>>>>> Arris FastPath Speed
>>>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
>>>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>>>> HNC IGD
>>>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>>>>> Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First
>>>>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
>>>>>
>>>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>>>> HNC IGD","Annex F
>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
>>>>>> Arris FastPath Speed
>>>>>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
>>>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>>>> HNC IGD
>>>>>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
>>>>>> Control","GPON_100M_100M","Self-Service
>>>>>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
>>>>>
>>>>> {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>>>> HNC IGD","Annex F
>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports
>>>>>> Arris FastPath Speed
>>>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","
>>>>>> Arris.NVG4xx.Missing.CA <http://arris.nvg4xx.missing.ca/>","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>>>> HNC IGD
>>>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total
>>>>>> Control","HSI","Self-Service
>>>>>> Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW -
NVG4xx -
>>>>>> First Contact","GPON_200M_200M","TCW
>>>>>> Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}
>>>>>
>>>>>
>>>> On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <artemis@dtechspace.com>
>>>> wrote:
>>>>
>>>>> Could you please share your input file instead of output files on that
>>>>> ticket?   Not sure if you were following the specific file format
>>>>> requirement for JSON in Spark.  The following is a snippet from the Spark
>>>>> online doc:
>>>>>
>>>>> Note that the file that is offered as *a json file* is not a typical
>>>>> JSON file. Each line must contain a separate, self-contained valid JSON
>>>>> object. For more information, please see JSON Lines text format, also
>>>>> called newline-delimited JSON <http://jsonlines.org/>.
>>>>>
>>>>> For a regular multi-line JSON file, set the multiLine option to true.
>>>>>
>>>>> -- ND
>>>>> On 6/29/20 11:55 AM, Sanjeev Mishra wrote:
>>>>>
>>>>> Done. https://issues.apache.org/jira/browse/SPARK-32130
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.gekk@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Sanjeev,
>>>>>>
>>>>>> It is hard to troubleshoot the issue without input files. Could you
>>>>>> open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK
and
>>>>>> attach the JSON files there (or samples or code which generates JSON
>>>>>> files)?
>>>>>>
>>>>>> Maxim Gekk
>>>>>> Software Engineer
>>>>>> Databricks, Inc.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <
>>>>>> sanjeev.mishra@gmail.com> wrote:
>>>>>>
>>>>>>> It has read everything. As you notice the timing of count is
still
>>>>>>> smaller in Spark 2.4
>>>>>>>
>>>>>>> Spark 2.4
>>>>>>>
>>>>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>>>>> Time taken: 19691 ms
>>>>>>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id:
string
>>>>>>> ... 5 more fields]
>>>>>>>
>>>>>>> scala> spark.time(res61.count())
>>>>>>> Time taken: 7113 ms
>>>>>>> res64: Long = 2605349
>>>>>>>
>>>>>>> Spark 3.0
>>>>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>>>>> 20/06/29 08:06:53 WARN package: Truncated the string representation
>>>>>>> of a plan since it was too large. This behavior can be adjusted
by setting
>>>>>>> 'spark.sql.debug.maxToStringFields'.
>>>>>>> Time taken: 849652 ms
>>>>>>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id:
string
>>>>>>> ... 5 more fields]
>>>>>>>
>>>>>>> scala> spark.time(res0.count())
>>>>>>> Time taken: 8201 ms
>>>>>>> res2: Long = 2605349
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <artemis@dtechspace.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Could you share your code?  Are you sure you Spark 2.4 cluster
had
>>>>>>>> indeed read anything?  Looks like the Input size field is
empty under 2.4.
>>>>>>>>
>>>>>>>> -- ND
>>>>>>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I have large amount of json files that Spark can read in
36 seconds
>>>>>>>> but Spark 3.0 takes almost 33 minutes to read the same. On
closer analysis,
>>>>>>>> looks like Spark 3.0 is choosing different DAG than Spark
2.0. Does anyone
>>>>>>>> have any idea what is going on? Is there any configuration
problem with
>>>>>>>> Spark 3.0.
>>>>>>>>
>>>>>>>> Here are the details:
>>>>>>>>
>>>>>>>> *Spark 2.4*
>>>>>>>>
>>>>>>>> Summary Metrics for 2203 Completed Tasks
>>>>>>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>>>>>>>> Metric Min 25th percentile Median 75th percentile Max
>>>>>>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>>>>>>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>>>>>>>> Showing 1 to 2 of 2 entries
>>>>>>>>   Aggregated Metrics by Executor
>>>>>>>> Show  20 40 60 100 All  entries
>>>>>>>> Search:
>>>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks
Killed
>>>>>>>> Tasks Succeeded Tasks Blacklisted
>>>>>>>> driver
>>>>>>>> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>>>>>>>
>>>>>>>>
>>>>>>>> *Spark 3.0*
>>>>>>>>
>>>>>>>> Summary Metrics for 8 Completed Tasks
>>>>>>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>>>>>>>> Metric Min 25th percentile Median 75th percentile Max
>>>>>>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>>>>>>>> GC Time 3 s 3 s 3 s 4 s 4 s
>>>>>>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8
MiB /
>>>>>>>> 55259 17.8 MiB / 58148 20.2 MiB / 71624
>>>>>>>> Showing 1 to 3 of 3 entries
>>>>>>>>   Aggregated Metrics by Executor
>>>>>>>> Show  20 40 60 100 All  entries
>>>>>>>> Search:
>>>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks
Killed
>>>>>>>> Tasks Succeeded Tasks Blacklisted Input Size / Records
>>>>>>>> driver
>>>>>>>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>>>>>>>
>>>>>>>>
>>>>>>>> The DAG is also different
>>>>>>>> Spark 2.0 DAG
>>>>>>>>
>>>>>>>> <Screenshot 2020-06-27 16.30.26.png>
>>>>>>>>
>>>>>>>> Spark 3.0 DAG
>>>>>>>>
>>>>>>>> <Screenshot 2020-06-27 16.32.32.png>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>>
>

Mime
View raw message