spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
Date Sat, 21 Oct 2017 17:49:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ruslan Dautkhanov updated SPARK-21657:
--------------------------------------
    Affects Version/s: 2.3.0
           Issue Type: Bug  (was: Improvement)

> Spark has exponential time complexity to explode(array of structs)
> ------------------------------------------------------------------
>
>                 Key: SPARK-21657
>                 URL: https://issues.apache.org/jira/browse/SPARK-21657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>            Reporter: Ruslan Dautkhanov
>              Labels: cache, caching, collections, nested_types, performance, pyspark,
sparksql, sql
>         Attachments: ExponentialTimeGrowth.PNG, nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of records across
all nested collection (see `scaling` variable in loops). `scaling` variable scales up how
many nested elements in each record, but by the same factor scales down number of records
in the table. So total number of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to explode the nested
collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message