spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Whittal <ham...@cloud-fundis.co.za>
Subject Re: MySQL query continually add IS NOT NULL onto a query even though I don't request it
Date Thu, 02 Apr 2020 08:12:11 GMT
I'm posting here so that if anyone else has similar problems, it might be
of help.

First problem:
I tried multiple different MySQL JDBC drivers to no avail, then I put Spark
into Uber-chatty mode (DEBUG) and looked at the Catalyst compiler and
started seeing datatime's being compared to NULL. Funny that.

So I modified my query from:
    filter("auto_update_time >= '2020-04-01 05:27'")

to
    filter("auto_update_time >= to_timestamp('2020-04-01 05:27''))

and the IS NOT NULL has gone. And the query returns super-fast as expected.

Second problem:
The query runs for a long time. This was related to the first problem and
of course spark.stop() cancelled the query - but also killed my Spark
session that Zeppelin started for me. Not ideal. I ran this same query from
the spark-submit cmd line and it also hung up for a long while, but because
I ended up killing the spark-submit the query stopped running on the MySQL
server too. This hinted at the spark.stop() above.

Once I had fixed the query, the job completed in seconds and then the query
on the MySQL side completed too.

All's well that ends well. Time for a Gin and Tonic. Have fun everyone
(even if you're in lockdown like me!)

Hamish

On Wed, Apr 1, 2020 at 7:47 AM Hamish Whittal <hamish@cloud-fundis.co.za>
wrote:

> Hi folks,
>
> 1) First Problem:
> I'm querying MySQL. I submit a query like this:
>
> out = wam.select('message_id', 'business_id', 'info',
> 'entered_system_date', 'auto_update_time').filter("auto_update_time >=
> '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time'])
>
> But what I see in the DB is this:
>
> SELECT
> `message_id`,`business_id`,`info`,`entered_system_date`,`auto_update_time`
> FROM message WHERE (`auto_update_time` IS NOT NULL)
>
> Of course the IS NOT NULL is causing a scan of the whole table - SLOOOOW
>
> If I do the query straight by loading from MySQL with a query instead of
> dbtable:
>
> myQuery = 'select business_id, entered_system_date, message_id, info,
> auto_update_time from message where auto_update_time >= "2020-04-01 05:27"'
>
> jdbcUrl = jdbc:mysql://172.xx.xx.xx/prod
>
> wam = spark\
>     .read\
>     .format("jdbc")\
>     .option("url", jdbcUrl)\
>     .option("query", myQuery)\        <------------------------ difference
> here
>     .option("user", connection['scriptParams']['user'])\
>     .option("password", connection['scriptParams']['pass'])\
>     .option("numPartitions",10)\
>     .load()
>
> then it's fast as expected.
>
> Question: How come is the query adding on the IS NOT NULL - and this is to
> every single query I try to perform if I use dbtable instead of query in
> the load/read?
>
> 2) Second Problem:
> Additionally, if I stop this query (I'm using Zeppelin), then the query on
> the DB side continues running for a long while (read 40 - 60s) after the
> query stops.
>
> Question: How come is this? I mean, it should stop within a second or 3 -
> not 60!
>
> Any advice would be welcome
>
> Version of MySQL driver: 8.0.19.
> Db MySQL 5.7.23
> Spark version 2.4.4
>
> Could it be a different version of the JDBC driver? I've tried 5.1.48 too.
> Same effect.
>
> Thanks in advance,
> Hamish
> --
> Cloud-Fundis.co.za
> Cape Town, South Africa
> +27 79 614 4913
>


-- 
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913

Mime
View raw message