spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Sun <>
Subject Re: [SparkSql] Casting of Predicate Literals
Date Wed, 26 Aug 2020 17:06:36 GMT
Thanks Bart. I'll give it a try. Presto has done something very similar on
this (thanks DB for finding this!). They published an article ([1]) last
year with a very thorough analysis on all the cases which I think can be
used as a reference for the implementation in Spark.


On Wed, Aug 26, 2020 at 1:37 AM Bart Samwel <>

> IMO it's worth an attempt. The previous attempts seem to be closed because
> of a general sense that this gets messy and leads to lots of special cases,
> but that's just how it is. This optimization would make the difference
> between getting sub-par performance for using some of these datatypes to
> getting decent performance. Also, even if the predicate doesn't get pushed
> down, the transformation can make execution of the predicate faster. So
> this can be an early optimization rule, not tied to pushdowns specifically.
> I agree that it gets tricky for some data types. So I'd suggest starting
> small and doing this only for integers. Then cover decimals. For those data
> types at least you can easily reason that the conversion is correct. Other
> data types are a lot trickier and we should analyze them one by one.
> On Tue, Aug 25, 2020 at 7:31 PM Chao Sun <> wrote:
>> Hi,
>> So just realized there were already multiple attempts on this issue in
>> the past. From the discussion it seems the preferred approach is to
>> eliminate the cast before they get pushed to data sources, at least for a
>> few common cases such as numeric types. However, a few PRs following this
>> direction were rejected (see [1] and [2]), so I'm wondering if this is
>> still something worth trying, or if the community thinks this is risky and
>> better not touch it.
>> On the other hand, perhaps we can do the minimum and generate some sort
>> of warning to remind users that they need to explicitly add cast to enable
>> pushdown in this case. What do you think?
>> Thanks for your input!
>> Chao
>> [1]:
>> [2]:
>> On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <> wrote:
>>> > Currently we can't. This is something we should improve, by either
>>> pushing down the cast to the data source, or simplifying the predicates
>>> to eliminate the cast.
>>> Hi all, I've created to
>>> track this. Welcome to comment on the JIRA.
>>> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <> wrote:
>>>> Currently we can't. This is something we should improve, by either
>>>> pushing down the cast to the data source, or simplifying the predicates to
>>>> eliminate the cast.
>>>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <>
>>>> wrote:
>>>>> And how are we doing here on integer pushdowns? If someone does e.g.
>>>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col
< 1000"
>>>>> without the cast?
>>>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <
>>>>>> wrote:
>>>>>> Thanks! That's exactly what I was hoping for! Thanks for finding
>>>>>> Jira for me!
>>>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <>
>>>>>> wrote:
>>>>>>> I think this is not a problem in 3.0 anymore, see
>>>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>>>>>>>> wrote:
>>>>>>>> I've just run into this issue again with another user and
I feel
>>>>>>>> like most folks here have seen some flavor of this at some
>>>>>>>> The user registers a Datasource with a column of type Date
(or some
>>>>>>>> non string) then performs a query that looks like.
>>>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>>>>>> Seeing that the predicate literal here is a String, Spark
needs to
>>>>>>>> make a change so that the DataSource column will be of the
same type
>>>>>>>> (Date),
>>>>>>>> so it places a "Cast" on the Datasource column so our plan
ends up
>>>>>>>> looking like.
>>>>>>>> Cast(date_col as String) > '2020-08-03'
>>>>>>>> Since the Datasource Strategies can't handle a push down
of the
>>>>>>>> "Cast" function we lose the predicate pushdown we could
>>>>>>>> have had. This can change a Job from a single partition lookup
>>>>>>>> a full scan leading to a very confusing situation for
>>>>>>>> the end user. I also wonder about the relative cost here
since we
>>>>>>>> could be avoiding doing X casts and instead just do a single
>>>>>>>> one on the predicate, in addition we could be doing the cast
at the
>>>>>>>> Analysis phase and cut the run short before any work even
>>>>>>>> starts rather than doing a perhaps meaningless comparison
between a
>>>>>>>> date and a non-date string.
>>>>>>>> I think we should seriously consider whether in cases like
this we
>>>>>>>> should attempt to cast the literal rather than casting the
>>>>>>>> source column.
>>>>>>>> Please let me know if anyone has thoughts on this, or has
>>>>>>>> previous Jiras I could dig into if it's been discussed before,
>>>>>>>> Russ
>>>>> --
>>>>> Bart Samwel
> --
> Bart Samwel

View raw message