Hm, you might want to ask on the dev list if you don't get a good answer here. I'm also trying to decipher this part of the code as I'm having issues with predicate pushes. I can see (in master branch) that the SQL codepath (which is taken if you don't convert the metastore) C:\spark-master\sql\core\src\main\scala\org\apache\spark\sql\parquet\ParquetTableOperations.scala around line 107 pushed the parquet filters into a hadoop configuration object . Spark1.2 has similar code in the same file, via method ParquetInputFormat.setFilterPredicate. But I think in the case where you go through HiveTableScan you'd go through C:\spark-master\sql\hive\src\main\scala\org\apache\spark\sql\hive\TableReader.scala and I don't see anything happening with the filters there. But I'm not a dev on this project -- mostly I'm really interested in the answer. Please do update if you figure this out!

On Mon, Jan 19, 2015 at 8:02 PM, Xiaoyu Wang <wangxy.jd@gmail.com> wrote:
The spark.sql.parquet.filterPushdown=true has been turned on. But set spark.sql.hive.convertMetastoreParquet to false. the first parameter is lose efficacy!!!

2015-01-20 6:52 GMT+08:00 Yana Kadiyska <yana.kadiyska@gmail.com>:
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try  spark.sql.parquet.filterPushdown=true . It's off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang <wangxy.jd@gmail.com> wrote:
Yes it works!
But the filter can't pushdown!!!

If custom parquetinputformat only implement the datasource API?


2015-01-16 21:51 GMT+08:00 Xiaoyu Wang <wangxy.jd@gmail.com>:
Thanks yana!
I will try it!

在 2015年1月16日,20:51,yana <yana.kadiyska@gmail.com> 写道:

I think you might need to set 
spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly

Sent on the new Sprint Network from my Samsung Galaxy S®4.


-------- Original message --------
From: Xiaoyu Wang
Date:01/16/2015 5:09 AM (GMT-05:00)
Subject: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

Hi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and outputformat.
like this :
CREATE TABLE test(
  id string, 
  msg string)
CLUSTERED BY ( 
  id) 
SORTED BY ( 
  id ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE
  'com.a.MyParquetHiveSerDe'
STORED AS INPUTFORMAT 
  'com.a.MyParquetInputFormat
OUTPUTFORMAT 
  'com.a.MyParquetOutputFormat';

And the spark shell see the plan of "select * from test" is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ ParquetTableScan [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

Not HiveTableScan!!!
So it dosn't execute my custom inputformat!
Why? How can it execute my custom inputformat?

Thanks!