spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Preventing predicate pushdown
Date Tue, 15 May 2018 16:29:34 GMT
applying predict pushdown is an optimization, and it makes sense to provide
configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda <tomasz.gaweda@outlook.com>
wrote:

> Hi,
>
> while working with JDBC datasource I saw that many "or" clauses with
> non-equality operators causes huge performance degradation of SQL query
> to database (DB2). For example:
>
> val df = spark.read.format("jdbc").(other options to parallelize
> load).load()
>
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
>  > 100)").show() // in real application whose predicates were pushed
> many many lines below, many ANDs and ORs
>
> If I use cache() before where, there is no predicate pushdown of this
> "where" clause. However, in production system caching many sources is a
> waste of memory (especially is pipeline is long and I must do cache many
> times).
>
>
> I asked on StackOverflow for better ideas:
> https://stackoverflow.com/questions/50336355/how-to-
> prevent-predicate-pushdown
>
> However, there are only workarounds. I can use those workarounds, but
> maybe it would be better to add such functionality directly in the API?
>
> For example: df.withAnalysisBarrier().where(...) ?
>
> Please let me know if I should create a JIRA or it's not a good idea for
> some reasons.
>
>
> Pozdrawiam / Best regards,
>
> Tomek Gawęda
>
>

Mime
View raw message