manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Too many slow queries caused by MCF running MySQL 5.5
Date Tue, 11 Dec 2012 01:44:38 GMT
Hi Shigeki,

I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
This has a good chance of fixing the query performance problem.
Please try it out, and let me know if you still get slow queries in
the log.  You should be to use the existing database instance.

Thanks,
Karl

On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright <daddywri@gmail.com> wrote:
> Experiments here indicate that FORCE INDEX seems to do what we need.
>
> I'm going to think about it a bit and then come up with a fix that
> should use FORCE INDEX in this situation.  Then we can see if it
> actually helps for you.
>
> Karl
>
>
> On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Sorry, the FORCE INDEX hint requires the name of the index.  Since
>> ManifoldCF does not assign index names to fixed values, you will need
>> to find the right one, by using the SHOW INDEX command first to get
>> the right index's name.
>>
>> Apologies,
>> Karl
>>
>>
>> On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> Ok, that is unfortunate.  I will do some further MySQL research here.
>>> There is a FORCE INDEX MySQL construct that may help, e.g.
>>>
>>> SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
>>>
>>> which we can also try.  In this case that would be: FORCE INDEX
>>> (docpriority,status,checkaction,checktime) or FORCE INDEX
>>> (docpriority_status_checkaction_checktime)  - unclear what the right
>>> syntax actually is.  Maybe you can try an explain with that in the
>>> query?
>>>
>>> FWIW, PostgreSQL should always use the index for this situation.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
>>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>>> Hi Karl,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> I did EXPLAIN as following:
>>>>
>>>> mysql> explain SELECT
>>>>     ->
>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>>     -> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN
('P','G')
>>>>     -> AND t0.checkaction='R' AND
>>>>     -> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs
t1 WHERE
>>>>     -> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND
NOT
>>>>     -> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
AND
>>>> t2.status
>>>>     -> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
>>>> EXISTS(SELECT
>>>>     -> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>>>>     -> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>>>>     -> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>>>> | id | select_type        | table | type   | possible_keys
>>>> | key            | key_len | ref                     | rows   | Extra
>>>> |
>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>>>> |  1 | PRIMARY            | t0    | range  |
>>>> I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25      |
>>>> NULL                    | 151494 | Using where; Using filesort |
>>>> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1354241297077
>>>> | I1354241297077 | 8       | manifoldcf.t0.id        |      1 |
>>>> |
>>>> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
>>>> | PRIMARY        | 767     | manifoldcf.t3.eventname |      1 | Using index
>>>> |
>>>> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
>>>> I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122     |
>>>> manifoldcf.t0.dochash   |      1 | Using where                 |
>>>> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1354241297080
>>>> | PRIMARY        | 8       | manifoldcf.t0.jobid     |      1 | Using where
>>>> |
>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>>>>
>>>>
>>>> As you see "Using filesort", I do not think it uses the index.
>>>>
>>>> By the way, which database do you recommend for the case of crawling  a
>>>> humongous number of files for now? PostgreSQL?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Shigeki
>>>>
>>>> 2012/12/10 Karl Wright <daddywri@gmail.com>
>>>>>
>>>>> Since you have a large table, can you try an EXPLAIN for the following
>>>>> query, which should match the explanation given here:
>>>>> http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
>>>>> Does it use the index?
>>>>>
>>>>> SELECT
>>>>>
>>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>>> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN ('P','G')
>>>>> AND t0.checkaction='R' AND
>>>>> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>>>>> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
>>>>> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
>>>>> t2.status
>>>>> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
>>>>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>>>>> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>>>>> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>> On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>> > Hi Shigeki,
>>>>> >
>>>>> > The rules for when a database will use an index for an ORDER BY
clause
>>>>> > differ significantly from database to database.  The current logic
>>>>> > seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not
MySQL.
>>>>> >  I will see if I can find a solution.  The ticket for this
>>>>> > CONNECTORS-584.
>>>>> >
>>>>> > Karl
>>>>> >
>>>>> > On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
>>>>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>>>> >>
>>>>> >> Hi.
>>>>> >>
>>>>> >>
>>>>> >> I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
>>>>> >> I tried to crawl 10 million files using Windows share connection
and
>>>>> >> index
>>>>> >> them into Solr.
>>>>> >>
>>>>> >> As MCF reached over 1 million files, the crawling speed started
getting
>>>>> >> slower.
>>>>> >> So I checked slow queries and found out that too many slow queries
>>>>> >> occurred,
>>>>> >> especially the following kinds:
>>>>> >>
>>>>> >> --------------------------------------------------------
>>>>> >> # Time: 121204 16:25:40
>>>>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>>> >> # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200
>>>>> >> Rows_examined:
>>>>> >> 611091
>>>>> >> SET timestamp=1354605940;
>>>>> >> SELECT
>>>>> >>
>>>>> >> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>>> >> FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R'
>>>>> >> AND
>>>>> >> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs
t1 WHERE
>>>>> >> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
AND NOT
>>>>> >> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
AND
>>>>> >> t2.status
>>>>> >> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
>>>>> >> EXISTS(SELECT
>>>>> >> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>>>>> >> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>>>>> >> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
>>>>> >> # Time: 121204 16:25:44
>>>>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>>> >> # Query_time: 3.064339  Lock_time: 0.000084 Rows_sent: 1
>>>>> >> Rows_examined:
>>>>> >> 406359
>>>>> >> SET timestamp=1354605944;
>>>>> >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE
status IN
>>>>> >> ('P','G') AND checkaction='R' AND checktime<=1354605932817
AND
>>>>> >> EXISTS(SELECT
>>>>> >> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)
>>>>> >> ORDER BY
>>>>> >> docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT
1;
>>>>> >> -------------------------------------------------------
>>>>> >>
>>>>> >> I wonder if the queries appropriately use index of the table.
>>>>> >> As a result of EXPLAIN against the slow query, there was filesort.
>>>>> >> There seems to be some conditions that MySQL does not use index
>>>>> >> depending on
>>>>> >> ORDER BY:
>>>>> >>  - Executing ORDER BY against multiple keys
>>>>> >>  - When keys selected from records are different from keys used
by
>>>>> >> ORDER BY
>>>>> >>
>>>>> >> Since filesort was happening, fully scanning records should
be having
>>>>> >> MCF
>>>>> >> slower.
>>>>> >>
>>>>> >> Do you think this could happen even in PostgreSQL or HSQLDB?
>>>>> >> Do you think queries could be modified to use index appropriately?
>>>>> >>
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >> Shigeki
>>>>
>>>>
>>>>
>>>>
>>>>

Mime
View raw message