spark.sql.parquet.filterPushdown defaults to false because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed.

Cheng

On 1/19/15 5:02 PM, Xiaoyu Wang wrote:

The spark.sql.parquet.filterPushdown=true has been turned on. But set spark.sql.hive.convertMetastoreParquet to false. the first parameter is lose efficacy!!!

2015-01-20 6:52 GMT+08:00 Yana Kadiyska <yana.kadiyska@gmail.com>:
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try  spark.sql.parquet.filterPushdown=true . It's off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang <wangxy.jd@gmail.com> wrote:
Yes it works!
But the filter can't pushdown!!!

If custom parquetinputformat only implement the datasource API?


2015-01-16 21:51 GMT+08:00 Xiaoyu Wang <wangxy.jd@gmail.com>:
Thanks yana!
I will try it!

在 2015年1月16日,20:51,yana <yana.kadiyska@gmail.com> 写道:

I think you might need to set 
spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly

Sent on the new Sprint Network from my Samsung Galaxy S®4.


-------- Original message --------
From: Xiaoyu Wang
Date:01/16/2015 5:09 AM (GMT-05:00)
Subject: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

Hi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and outputformat.
like this :
CREATE TABLE test(
  id string, 
  msg string)
CLUSTERED BY ( 
  id) 
SORTED BY ( 
  id ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE
  'com.a.MyParquetHiveSerDe'
STORED AS INPUTFORMAT 
  'com.a.MyParquetInputFormat
OUTPUTFORMAT 
  'com.a.MyParquetOutputFormat';

And the spark shell see the plan of "select * from test" is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ ParquetTableScan [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

Not HiveTableScan!!!
So it dosn't execute my custom inputformat!
Why? How can it execute my custom inputformat?

Thanks!