spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raju Bairishetti <>
Subject Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided
Date Wed, 11 Jan 2017 04:14:28 GMT

   Spark sql is generating query plan with all partitions information even
though if we apply filters on partitions in the query.  Due to this, spark
driver/hive metastore is hitting with OOM as each table is with lots of

We can confirm from hive audit logs that it tries to *fetch all partitions*
from hive metastore.

 2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
( - ugi=rajub    ip=/x.x.x.x
cmd=get_partitions : db=xxxx tbl=xxxxx

Configured the following parameters in the spark conf to fix the above
issue(source: from spark-jira & github pullreq):

*spark.sql.hive.convertMetastoreParquet   false*
*    spark.sql.hive.metastorePartitionPruning   true*

*   plan:  rdf.explain*
*   == Physical Plan ==*
       HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
28),(hour#317 = 2),(venture#318 = DEFAULT)]

*    get_partitions_by_filter* method is called and fetching only required

    But we are seeing parquetDecode errors in our applications frequently
after this. Looks like these decoding errors were because of changing
serde from spark-builtin to hive serde.

I feel like,* fixing query plan generation in the spark-sql* is the right
approach instead of forcing users to use hive serde.

Is there any workaround/way to fix this issue? I would like to hear more
thoughts on this :)

Raju Bairishetti,

View raw message