Hi folks,

1) First Problem:
I'm querying MySQL. I submit a query like this:

out = wam.select('message_id', 'business_id', 'info', 'entered_system_date', 'auto_update_time').filter("auto_update_time >= '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time'])

But what I see in the DB is this:

SELECT `message_id`,`business_id`,`info`,`entered_system_date`,`auto_update_time` FROM message WHERE (`auto_update_time` IS NOT NULL)

Of course the IS NOT NULL is causing a scan of the whole table - SLOOOOW

If I do the query straight by loading from MySQL with a query instead of dbtable:

myQuery = 'select business_id, entered_system_date, message_id, info, auto_update_time from message where auto_update_time >= "2020-04-01 05:27"'

jdbcUrl = jdbc:mysql://172.xx.xx.xx/prod

wam = spark\
    .read\
    .format("jdbc")\
    .option("url", jdbcUrl)\
    .option("query", myQuery)\        <------------------------ difference here
    .option("user", connection['scriptParams']['user'])\
    .option("password", connection['scriptParams']['pass'])\
    .option("numPartitions",10)\
    .load()

then it's fast as expected.

Question: How come is the query adding on the IS NOT NULL - and this is to every single query I try to perform if I use dbtable instead of query in the load/read?

2) Second Problem:
Additionally, if I stop this query (I'm using Zeppelin), then the query on the DB side continues running for a long while (read 40 - 60s) after the query stops.

Question: How come is this? I mean, it should stop within a second or 3 - not 60!

Any advice would be welcome

Version of MySQL driver: 8.0.19.
Db MySQL 5.7.23
Spark version 2.4.4

Could it be a different version of the JDBC driver? I've tried 5.1.48 too. Same effect.

Thanks in advance,
Hamish
--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913