manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Too many slow queries caused by MCF running MySQL 5.5
Date Mon, 10 Dec 2012 11:41:24 GMT
Ok, that is unfortunate.  I will do some further MySQL research here.
There is a FORCE INDEX MySQL construct that may help, e.g.

SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...

which we can also try.  In this case that would be: FORCE INDEX
(docpriority,status,checkaction,checktime) or FORCE INDEX
(docpriority_status_checkaction_checktime)  - unclear what the right
syntax actually is.  Maybe you can try an explain with that in the
query?

FWIW, PostgreSQL should always use the index for this situation.

Karl


On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
> Hi Karl,
>
> Thanks for the reply.
>
> I did EXPLAIN as following:
>
> mysql> explain SELECT
>     ->
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>     -> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN ('P','G')
>     -> AND t0.checkaction='R' AND
>     -> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>     -> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
>     -> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
> t2.status
>     -> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
> EXISTS(SELECT
>     -> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>     -> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>     -> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
> | id | select_type        | table | type   | possible_keys
> | key            | key_len | ref                     | rows   | Extra
> |
> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
> |  1 | PRIMARY            | t0    | range  |
> I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25      |
> NULL                    | 151494 | Using where; Using filesort |
> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1354241297077
> | I1354241297077 | 8       | manifoldcf.t0.id        |      1 |
> |
> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
> | PRIMARY        | 767     | manifoldcf.t3.eventname |      1 | Using index
> |
> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
> I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122     |
> manifoldcf.t0.dochash   |      1 | Using where                 |
> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1354241297080
> | PRIMARY        | 8       | manifoldcf.t0.jobid     |      1 | Using where
> |
> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>
>
> As you see "Using filesort", I do not think it uses the index.
>
> By the way, which database do you recommend for the case of crawling  a
> humongous number of files for now? PostgreSQL?
>
>
> Regards,
>
> Shigeki
>
> 2012/12/10 Karl Wright <daddywri@gmail.com>
>>
>> Since you have a large table, can you try an EXPLAIN for the following
>> query, which should match the explanation given here:
>> http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
>> Does it use the index?
>>
>> SELECT
>>
>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN ('P','G')
>> AND t0.checkaction='R' AND
>> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
>> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
>> t2.status
>> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200
>>
>> Thanks!
>> Karl
>>
>> On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright <daddywri@gmail.com> wrote:
>> > Hi Shigeki,
>> >
>> > The rules for when a database will use an index for an ORDER BY clause
>> > differ significantly from database to database.  The current logic
>> > seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
>> >  I will see if I can find a solution.  The ticket for this
>> > CONNECTORS-584.
>> >
>> > Karl
>> >
>> > On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>
>> >> Hi.
>> >>
>> >>
>> >> I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
>> >> I tried to crawl 10 million files using Windows share connection and
>> >> index
>> >> them into Solr.
>> >>
>> >> As MCF reached over 1 million files, the crawling speed started getting
>> >> slower.
>> >> So I checked slow queries and found out that too many slow queries
>> >> occurred,
>> >> especially the following kinds:
>> >>
>> >> --------------------------------------------------------
>> >> # Time: 121204 16:25:40
>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>> >> # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200
>> >> Rows_examined:
>> >> 611091
>> >> SET timestamp=1354605940;
>> >> SELECT
>> >>
>> >> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> >> FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R'
>> >> AND
>> >> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>> >> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
>> >> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
>> >> t2.status
>> >> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
>> >> EXISTS(SELECT
>> >> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>> >> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>> >> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
>> >> # Time: 121204 16:25:44
>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>> >> # Query_time: 3.064339  Lock_time: 0.000084 Rows_sent: 1
>> >> Rows_examined:
>> >> 406359
>> >> SET timestamp=1354605944;
>> >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
>> >> ('P','G') AND checkaction='R' AND checktime<=1354605932817 AND
>> >> EXISTS(SELECT
>> >> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)
>> >> ORDER BY
>> >> docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
>> >> -------------------------------------------------------
>> >>
>> >> I wonder if the queries appropriately use index of the table.
>> >> As a result of EXPLAIN against the slow query, there was filesort.
>> >> There seems to be some conditions that MySQL does not use index
>> >> depending on
>> >> ORDER BY:
>> >>  - Executing ORDER BY against multiple keys
>> >>  - When keys selected from records are different from keys used by
>> >> ORDER BY
>> >>
>> >> Since filesort was happening, fully scanning records should be having
>> >> MCF
>> >> slower.
>> >>
>> >> Do you think this could happen even in PostgreSQL or HSQLDB?
>> >> Do you think queries could be modified to use index appropriately?
>> >>
>> >>
>> >> Regards,
>> >>
>> >> Shigeki
>
>
>
>
>

Mime
View raw message