spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (Jira)" <>
Subject [jira] [Updated] (SPARK-24914) totalSize is not a good estimate for broadcast joins
Date Mon, 16 Mar 2020 22:52:07 GMT


Dongjoon Hyun updated SPARK-24914:
    Affects Version/s:     (was: 3.0.0)

> totalSize is not a good estimate for broadcast joins
> ----------------------------------------------------
>                 Key: SPARK-24914
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Bruce Robbins
>            Priority: Major
> When determining whether to do a broadcast join, Spark estimates the size of the smaller
table as follows:
>  - if totalSize is defined and greater than 0, use it.
>  - else, if rawDataSize is defined and greater than 0, use it
>  - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)
> Therefore, Spark prefers totalSize over rawDataSize.
> Unfortunately, totalSize is often quite a bit smaller than the actual table size, since
it represents the size of the table's files on disk. Parquet and Orc files, for example, are
encoded and compressed. This can result in the JVM throwing an OutOfMemoryError while Spark
is loading the table into a HashedRelation, or when Spark actually attempts to broadcast the
> On the other hand, rawDataSize represents the uncompressed size of the dataset, according
to Hive documentation. This seems like a pretty good number to use in preference to totalSize.
However, due to HIVE-20079, this value is simply #columns * #rows. Once that bug is fixed,
it may be a superior statistic, at least for managed tables.
> In the meantime, we could apply a configurable "fudge factor" to totalSize, at least
for types of files that are encoded and compressed. Hive has the setting hive.stats.deserialization.factor,
which defaults to 1.0, and is described as follows:
> {quote}in the absence of uncompressed/raw data size, total file size will be used for
statistics annotation. But the file may be compressed, encoded and serialized which may be
lesser in size than the actual uncompressed/raw data size. This factor will be multiplied
to file size to estimate the raw data size.
> {quote}
> Also, I propose a configuration setting to allow the user to completely ignore rawDataSize,
since that value is broken (due to HIVE-20079). When that configuration setting is set to
true, Spark would instead estimate the table as follows:
> - if totalSize is defined and greater than 0, use totalSize*fudgeFactor.
>  - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)
> Caveat: This mitigates the issue only for Hive tables. It does not help much when the
user is reading files using {{}}, unless we apply the same fudge factor

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message