spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bart Samwel <bart.sam...@databricks.com.INVALID>
Subject Re: [SparkSql] Casting of Predicate Literals
Date Wed, 19 Aug 2020 09:08:56 GMT
And how are we doing here on integer pushdowns? If someone does e.g.
CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
without the cast?

On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <russell.spitzer@gmail.com>
wrote:

> Thanks! That's exactly what I was hoping for! Thanks for finding the Jira
> for me!
>
> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> I think this is not a problem in 3.0 anymore, see
>> https://issues.apache.org/jira/browse/SPARK-27638
>>
>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>>> I've just run into this issue again with another user and I feel like
>>> most folks here have seen some flavor of this at some point.
>>>
>>> The user registers a Datasource with a column of type Date (or some non
>>> string) then performs a query that looks like.
>>>
>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>
>>> Seeing that the predicate literal here is a String, Spark needs to make
>>> a change so that the DataSource column will be of the same type (Date),
>>> so it places a "Cast" on the Datasource column so our plan ends up
>>> looking like.
>>>
>>> Cast(date_col as String) > '2020-08-03'
>>>
>>> Since the Datasource Strategies can't handle a push down of the "Cast"
>>> function we lose the predicate pushdown we could
>>> have had. This can change a Job from a single partition lookup into a
>>> full scan leading to a very confusing situation for
>>> the end user. I also wonder about the relative cost here since we could
>>> be avoiding doing X casts and instead just do a single
>>> one on the predicate, in addition we could be doing the cast at the
>>> Analysis phase and cut the run short before any work even
>>> starts rather than doing a perhaps meaningless comparison between a date
>>> and a non-date string.
>>>
>>> I think we should seriously consider whether in cases like this we
>>> should attempt to cast the literal rather than casting the
>>> source column.
>>>
>>> Please let me know if anyone has thoughts on this, or has some previous
>>> Jiras I could dig into if it's been discussed before,
>>> Russ
>>>
>>

-- 
Bart Samwel
bart.samwel@databricks.com

Mime
View raw message