spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <lix...@databricks.com>
Subject Re: [SparkSql] Casting of Predicate Literals
Date Tue, 04 Aug 2020 16:52:12 GMT
Hi, Russell,

You might hit the other cases in which CAST blocks the predicate pushdown.
If the Cast was added by users and it changes the actual type, we are
unable to optimize it automatically because it could change the query
correctness. If it was added by our type coercion rules
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala>
to
make type consistent at query compile time, we can take a look at the
specific rule. If you think any of them is not reasonable or have different
behaviors from the other database systems, we can discuss it in the PRs or
JIRAs. In general, we have to be very cautious to make any change in these
rules since it could have a big impact and change the query results
silently.

Thanks,

On Tue, Aug 4, 2020 at 9:46 AM Wenchen Fan <cloud0fan@gmail.com> wrote:

> I think this is not a problem in 3.0 anymore, see
> https://issues.apache.org/jira/browse/SPARK-27638
>
> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <russell.spitzer@gmail.com>
> wrote:
>
>> I've just run into this issue again with another user and I feel like
>> most folks here have seen some flavor of this at some point.
>>
>> The user registers a Datasource with a column of type Date (or some non
>> string) then performs a query that looks like.
>>
>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>
>> Seeing that the predicate literal here is a String, Spark needs to make a
>> change so that the DataSource column will be of the same type (Date),
>> so it places a "Cast" on the Datasource column so our plan ends up
>> looking like.
>>
>> Cast(date_col as String) > '2020-08-03'
>>
>> Since the Datasource Strategies can't handle a push down of the "Cast"
>> function we lose the predicate pushdown we could
>> have had. This can change a Job from a single partition lookup into a
>> full scan leading to a very confusing situation for
>> the end user. I also wonder about the relative cost here since we could
>> be avoiding doing X casts and instead just do a single
>> one on the predicate, in addition we could be doing the cast at the
>> Analysis phase and cut the run short before any work even
>> starts rather than doing a perhaps meaningless comparison between a date
>> and a non-date string.
>>
>> I think we should seriously consider whether in cases like this we should
>> attempt to cast the literal rather than casting the
>> source column.
>>
>> Please let me know if anyone has thoughts on this, or has some previous
>> Jiras I could dig into if it's been discussed before,
>> Russ
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Mime
View raw message