spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <>
Subject MySQL query continually add IS NOT NULL onto a query even though I don't request it
Date Wed, 01 Apr 2020 05:47:26 GMT
Hi folks,

1) First Problem:
I'm querying MySQL. I submit a query like this:

out ='message_id', 'business_id', 'info',
'entered_system_date', 'auto_update_time').filter("auto_update_time >=
'2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time'])

But what I see in the DB is this:

FROM message WHERE (`auto_update_time` IS NOT NULL)

Of course the IS NOT NULL is causing a scan of the whole table - SLOOOOW

If I do the query straight by loading from MySQL with a query instead of

myQuery = 'select business_id, entered_system_date, message_id, info,
auto_update_time from message where auto_update_time >= "2020-04-01 05:27"'

jdbcUrl = jdbc:mysql://172.xx.xx.xx/prod

wam = spark\
    .option("url", jdbcUrl)\
    .option("query", myQuery)\        <------------------------ difference
    .option("user", connection['scriptParams']['user'])\
    .option("password", connection['scriptParams']['pass'])\

then it's fast as expected.

Question: How come is the query adding on the IS NOT NULL - and this is to
every single query I try to perform if I use dbtable instead of query in
the load/read?

2) Second Problem:
Additionally, if I stop this query (I'm using Zeppelin), then the query on
the DB side continues running for a long while (read 40 - 60s) after the
query stops.

Question: How come is this? I mean, it should stop within a second or 3 -
not 60!

Any advice would be welcome

Version of MySQL driver: 8.0.19.
Db MySQL 5.7.23
Spark version 2.4.4

Could it be a different version of the JDBC driver? I've tried 5.1.48 too.
Same effect.

Thanks in advance,
Cape Town, South Africa
+27 79 614 4913

View raw message