Spark sql is generating query plan with all partitions information even though if we apply filters on partitions in the query. Due to this, spark driver/hive metastore is hitting with OOM as each table is with lots of partitions.
We can confirm from hive audit logs that it tries to fetch all partitions from hive metastore.
2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x cmd=get_partitions : db=xxxx tbl=xxxxx
Configured the following parameters in the spark conf to fix the above issue(source: from spark-jira & github pullreq):
== Physical Plan ==
HiveTableScan [rejection_reason#626], MetastoreRelation dbname, tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = 28),(hour#317 = 2),(venture#318 = DEFAULT)]
get_partitions_by_filter method is called and fetching only required partitions.
But we are seeing parquetDecode errors in our applications frequently after this. Looks like these decoding errors were because of changing serde from spark-builtin to hive serde.
I feel like, fixing query plan generation in the spark-sql is the right approach instead of forcing users to use hive serde.
Is there any workaround/way to fix this issue? I would like to hear more thoughts on this :)