spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yana Kadiyska <yana.kadiy...@gmail.com>
Subject Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?
Date Tue, 20 Jan 2015 15:52:00 GMT
Hm, you might want to ask on the dev list if you don't get a good answer
here. I'm also trying to decipher this part of the code as I'm having
issues with predicate pushes. I can see (in master branch) that the SQL
codepath (which is taken if you don't convert the
metastore) C:\spark-master\sql\core\src\main\scala\org\apache\spark\sql\parquet\ParquetTableOperations.scala
around line 107 pushed the parquet filters into a hadoop configuration
object . Spark1.2 has similar code in the same file, via method
ParquetInputFormat.setFilterPredicate. But I think in the case where you go
through HiveTableScan you'd go through
C:\spark-master\sql\hive\src\main\scala\org\apache\spark\sql\hive\TableReader.scala
and I don't see anything happening with the filters there. But I'm not a
dev on this project -- mostly I'm really interested in the answer. Please
do update if you figure this out!

On Mon, Jan 19, 2015 at 8:02 PM, Xiaoyu Wang <wangxy.jd@gmail.com> wrote:

> The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set
> *spark.sql.hive.**convertMetastoreParquet *to *false*. the first
> parameter is lose efficacy!!!
>
> 2015-01-20 6:52 GMT+08:00 Yana Kadiyska <yana.kadiyska@gmail.com>:
>
>> If you're talking about filter pushdowns for parquet files this also has
>> to be turned on explicitly. Try  *spark.sql.parquet.**filterPushdown=true
>> . *It's off by default
>>
>> On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang <wangxy.jd@gmail.com> wrote:
>>
>>> Yes it works!
>>> But the filter can't pushdown!!!
>>>
>>> If custom parquetinputformat only implement the datasource API?
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>
>>> 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang <wangxy.jd@gmail.com>:
>>>
>>>> Thanks yana!
>>>> I will try it!
>>>>
>>>> 在 2015年1月16日,20:51,yana <yana.kadiyska@gmail.com> 写道:
>>>>
>>>> I think you might need to set
>>>> spark.sql.hive.convertMetastoreParquet to false if I understand that
>>>> flag correctly
>>>>
>>>> Sent on the new Sprint Network from my Samsung Galaxy S®4.
>>>>
>>>>
>>>> -------- Original message --------
>>>> From: Xiaoyu Wang
>>>> Date:01/16/2015 5:09 AM (GMT-05:00)
>>>> To: user@spark.apache.org
>>>> Subject: Why custom parquet format hive table execute
>>>> "ParquetTableScan" physical plan, not "HiveTableScan"?
>>>>
>>>> Hi all!
>>>>
>>>> In the Spark SQL1.2.0.
>>>> I create a hive table with custom parquet inputformat and outputformat.
>>>> like this :
>>>> CREATE TABLE test(
>>>>   id string,
>>>>   msg string)
>>>> CLUSTERED BY (
>>>>   id)
>>>> SORTED BY (
>>>>   id ASC)
>>>> INTO 10 BUCKETS
>>>> ROW FORMAT SERDE
>>>>   '*com.a.MyParquetHiveSerDe*'
>>>> STORED AS INPUTFORMAT
>>>>   '*com.a.MyParquetInputFormat*'
>>>> OUTPUTFORMAT
>>>>   '*com.a.MyParquetOutputFormat*';
>>>>
>>>> And the spark shell see the plan of "select * from test" is :
>>>>
>>>> [== Physical Plan ==]
>>>> [!OutputFaker [id#5,msg#6]]
>>>> [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
>>>> hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>> yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
>>>> org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]
>>>>
>>>> *Not HiveTableScan*!!!
>>>> *So it dosn't execute my custom inputformat!*
>>>> Why? How can it execute my custom inputformat?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message