spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Spark SQL Custom Predicate Pushdown
Date Sat, 17 Jan 2015 21:40:29 GMT
I see now. It optimizes the selection semantics so that less things need to
be included just to do a count(). Very nice. I did a collect() instead of a
count just to see what would happen and it looks like the all the expected
select fields were propagated down as expected. Thanks.





On Sat, Jan 17, 2015 at 4:29 PM, Michael Armbrust <michael@databricks.com>
wrote:

> How are you running your test here?  Are you perhaps doing a .count()?
>
> On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet <cjnolet@gmail.com> wrote:
>
>> Michael,
>>
>> What I'm seeing (in Spark 1.2.0) is that the required columns being
>> pushed down to the DataRelation are not the product of the SELECT clause
>> but rather just the columns explicitly included in the WHERE clause.
>>
>> Examples from my testing:
>>
>> SELECT * FROM myTable --> The required columns are empty.
>> SELECT key1 FROM myTable --> The required columns are empty
>> SELECT * FROM myTable where key1 = 'val1' --> The required columns
>> contains key1.
>> SELECT key1,key2 FROM myTable where key1 = 'val1' --> The required
>> columns contains key1
>> SELECT key1,key2 FROM myTable where key1 = 'val1' and key2 = 'val2' -->
>> The required columns cintains key1,key2
>>
>>
>>
>> I created SPARK-5296 for the OR predicate to be pushed down in some
>> capacity.
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jan 17, 2015 at 3:38 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> 1) The fields in the SELECT clause are not pushed down to the predicate
>>>> pushdown API. I have many optimizations that allow fields to be filtered
>>>> out before the resulting object is serialized on the Accumulo tablet
>>>> server. How can I get the selection information from the execution plan?
>>>> I'm a little hesitant to implement the data relation that allows me to see
>>>> the logical plan because it's noted in the comments that it could change
>>>> without warning.
>>>>
>>>
>>> I'm not sure I understand.  The list of required columns should be
>>> pushed down to the data source.  Are you looking for something more
>>> complicated?
>>>
>>>
>>>> 2) I'm surprised to find that the predicate pushdown filters get
>>>> completely removed when I do anything more complex in a where clause other
>>>> than simple AND statements. Using an OR statement caused the filter array
>>>> that was passed into the PrunedFilteredDataSource to be empty.
>>>>
>>>
>>> This was just an initial cut at the set of predicates to push down.  We
>>> can add Or.  Mind opening a JIRA?
>>>
>>
>>
>

Mime
View raw message