spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liupengcheng (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-30394) Skip collecting stats in DetermineTableStats rule when hive table is convertible to datasource tables
Date Tue, 31 Dec 2019 03:54:00 GMT
liupengcheng created SPARK-30394:
------------------------------------

             Summary: Skip collecting stats in DetermineTableStats rule when hive table is
convertible to  datasource tables
                 Key: SPARK-30394
                 URL: https://issues.apache.org/jira/browse/SPARK-30394
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.2, 3.0.0
            Reporter: liupengcheng


Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will scan hdfs
files to collect table stats in `DetermineTableStats` rule. But this can be expensive in some
cases, acutually we can skip this if this hive table can be converted to datasource table(parquet
etc.).

Before[SPARK-28573|https://issues.apache.org/jira/browse/SPARK-28573], the implementaion will
update the CatalogTableStatistics, which will cause the improper stats be used in joinSelection
when the hive table can be convert to datasource table.

In our production environment, user's highly compressed parquet table can cause OOMs when
doing `broadcastHashJoin` due to this improper stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message