spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (Jira)" <>
Subject [jira] [Updated] (SPARK-25643) Performance issues querying wide rows
Date Mon, 16 Mar 2020 22:54:06 GMT


Dongjoon Hyun updated SPARK-25643:
    Affects Version/s:     (was: 3.0.0)

> Performance issues querying wide rows
> -------------------------------------
>                 Key: SPARK-25643
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Bruce Robbins
>            Priority: Major
> Querying a small subset of rows from a wide table (e.g., a table with 6000 columns) can
be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled fairly evenly
throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result includes just
a few rows, the query can run much longer compared to an equivalent query against a similar
table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing the entire
row in the scan, just so the filter can look at a tiny subset of columns and almost certainly
throw the row away. The profiling shows 74% of time is spent in FileSourceScanExec, and that
time is spent across numerous writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, this all
sounds reasonable. However, I wonder if there is an optimization here where we can avoid realizing
the entire row until after the filter has selected the row.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message