spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuelin Cao <xuelincao2...@gmail.com>
Subject Re: Why Parquet Predicate Pushdown doesn't work?
Date Thu, 08 Jan 2015 06:40:49 GMT
Yes, the problem is, I've turned the flag on.

One possible reason for this is, the parquet file supports "predicate
pushdown" by setting statistical min/max value of each column on parquet
blocks. If in my test, the "groupID=10113000" is scattered in all parquet
blocks, then the predicate pushdown fails.

But, I'm not quite sure about that. I don't know whether there is any other
reason that can lead to this.


On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger <cody@koeninger.org> wrote:

> But Xuelin already posted in the original message that the code was using
>
> SET spark.sql.parquet.filterPushdown=true
>
> On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv <danielrulez@gmail.com>
> wrote:
>
>> Quoting Michael:
>> Predicate push down into the input format is turned off by default
>> because there is a bug in the current parquet library that null pointers
>> when there are full row groups that are null.
>>
>> https://issues.apache.org/jira/browse/SPARK-4258
>>
>> You can turn it on if you want:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
>>
>> Daniel
>>
>> On 7 בינו׳ 2015, at 08:18, Xuelin Cao <xuelincao@yahoo.com.INVALID>
>> wrote:
>>
>>
>> Hi,
>>
>>        I'm testing parquet file format, and the predicate pushdown is a
>> very useful feature for us.
>>
>>        However, it looks like the predicate push down doesn't work after
>> I set
>>        sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")
>>
>>        Here is my sql:
>>        *sqlContext.sql("select adId, adTitle  from ad where
>> groupId=10113000").collect*
>>
>>        Then, I checked the amount of input data on the WEB UI. But the
>> amount of input data is ALWAYS 80.2M regardless whether I turn the spark.sql.parquet.filterPushdown
>> flag on or off.
>>
>>        I'm not sure, if there is anything that I must do when *generating
>> *the parquet file in order to make the predicate pushdown available.
>> (Like ORC file, when creating the ORC file, I need to explicitly sort the
>> field that will be used for predicate pushdown)
>>
>>        Anyone have any idea?
>>
>>        And, anyone knows the internal mechanism for parquet predicate
>> pushdown?
>>
>>        Thanks
>>
>>
>>
>>
>

Mime
View raw message