manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Too many slow queries caused by MCF running MySQL 5.5
Date Tue, 11 Dec 2012 09:45:28 GMT
Sorry, My bad.

jcifs.jar was missing. Probably this is the cause.

Sorry


Shigeki


2012/12/11 Shigeki Kobayashi <shigeki.kobayashi3@g.softbank.co.jp>

> Hi Karl.
>
> I could build the source ok but the following code is missing
> from connectors.xml. Does this mean I built it incorrectly or this is on
> purpose?
> Do I have to just add the code to enable the Windows share connection?
>
>
>   <repositoryconnector name="Windows shares"
> class="org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector"/>
>
>
> Regards,
>
> Shigeki
>
>
> 2012/12/11 Karl Wright <daddywri@gmail.com>
>
>> Hi Shigeki,
>>
>> I'm uploading a new version of ManifoldCF 1.1-dev, which you can pick
>> up at http://people.apache.org/~kwright/apache-manifoldcf-1.1-dev .
>> This has a good chance of fixing the query performance problem.
>> Please try it out, and let me know if you still get slow queries in
>> the log.  You should be to use the existing database instance.
>>
>> Thanks,
>> Karl
>>
>> On Mon, Dec 10, 2012 at 5:05 PM, Karl Wright <daddywri@gmail.com> wrote:
>> > Experiments here indicate that FORCE INDEX seems to do what we need.
>> >
>> > I'm going to think about it a bit and then come up with a fix that
>> > should use FORCE INDEX in this situation.  Then we can see if it
>> > actually helps for you.
>> >
>> > Karl
>> >
>> >
>> > On Mon, Dec 10, 2012 at 8:01 AM, Karl Wright <daddywri@gmail.com>
>> wrote:
>> >> Sorry, the FORCE INDEX hint requires the name of the index.  Since
>> >> ManifoldCF does not assign index names to fixed values, you will need
>> >> to find the right one, by using the SHOW INDEX command first to get
>> >> the right index's name.
>> >>
>> >> Apologies,
>> >> Karl
>> >>
>> >>
>> >> On Mon, Dec 10, 2012 at 6:41 AM, Karl Wright <daddywri@gmail.com>
>> wrote:
>> >>> Ok, that is unfortunate.  I will do some further MySQL research here.
>> >>> There is a FORCE INDEX MySQL construct that may help, e.g.
>> >>>
>> >>> SELECT ... FROM ... FORCE INDEX (key1_key2_key3) WHERE ...
>> >>>
>> >>> which we can also try.  In this case that would be: FORCE INDEX
>> >>> (docpriority,status,checkaction,checktime) or FORCE INDEX
>> >>> (docpriority_status_checkaction_checktime)  - unclear what the right
>> >>> syntax actually is.  Maybe you can try an explain with that in the
>> >>> query?
>> >>>
>> >>> FWIW, PostgreSQL should always use the index for this situation.
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>> On Mon, Dec 10, 2012 at 5:27 AM, Shigeki Kobayashi
>> >>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>>> Hi Karl,
>> >>>>
>> >>>> Thanks for the reply.
>> >>>>
>> >>>> I did EXPLAIN as following:
>> >>>>
>> >>>> mysql> explain SELECT
>> >>>>     ->
>> >>>> t0.id
>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> >>>>     -> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status
IN
>> ('P','G')
>> >>>>     -> AND t0.checkaction='R' AND
>> >>>>     -> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM
jobs
>> t1 WHERE
>> >>>>     -> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
>> AND NOT
>> >>>>     -> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>> t2.dochash=t0.dochash AND
>> >>>> t2.status
>> >>>>     -> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND
NOT
>> >>>> EXISTS(SELECT
>> >>>>     -> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner
AND
>> >>>>     -> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>> >>>>     -> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
>> >>>>
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>> >>>> | id | select_type        | table | type   | possible_keys
>> >>>> | key            | key_len | ref                     | rows   |
Extra
>> >>>> |
>> >>>>
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>> >>>> |  1 | PRIMARY            | t0    | range  |
>> >>>> I1354241297073,I1354241297072,I1354241297071 | I1354241297071 |
25
>>    |
>> >>>> NULL                    | 151494 | Using where; Using filesort |
>> >>>> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1354241297077
>> >>>> | I1354241297077 | 8       | manifoldcf.t0.id        |      1 |
>> >>>> |
>> >>>> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
>> >>>> | PRIMARY        | 767     | manifoldcf.t3.eventname |      1 |
>> Using index
>> >>>> |
>> >>>> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
>> >>>> I1354241297070,I1354241297073,I1354241297072 | I1354241297070 |
122
>>     |
>> >>>> manifoldcf.t0.dochash   |      1 | Using where                 |
>> >>>> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1354241297080
>> >>>> | PRIMARY        | 8       | manifoldcf.t0.jobid     |      1 |
>> Using where
>> >>>> |
>> >>>>
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
>> >>>>
>> >>>>
>> >>>> As you see "Using filesort", I do not think it uses the index.
>> >>>>
>> >>>> By the way, which database do you recommend for the case of crawling
>>  a
>> >>>> humongous number of files for now? PostgreSQL?
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Shigeki
>> >>>>
>> >>>> 2012/12/10 Karl Wright <daddywri@gmail.com>
>> >>>>>
>> >>>>> Since you have a large table, can you try an EXPLAIN for the
>> following
>> >>>>> query, which should match the explanation given here:
>> >>>>> http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html
?
>> >>>>> Does it use the index?
>> >>>>>
>> >>>>> SELECT
>> >>>>>
>> >>>>> t0.id
>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> >>>>> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status
IN
>> ('P','G')
>> >>>>> AND t0.checkaction='R' AND
>> >>>>> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs
t1 WHERE
>> >>>>> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
AND
>> NOT
>> >>>>> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
AND
>> >>>>> t2.status
>> >>>>> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
>> EXISTS(SELECT
>> >>>>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>> >>>>> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>> >>>>> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Karl
>> >>>>>
>> >>>>> On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright <daddywri@gmail.com>
>> wrote:
>> >>>>> > Hi Shigeki,
>> >>>>> >
>> >>>>> > The rules for when a database will use an index for an
ORDER BY
>> clause
>> >>>>> > differ significantly from database to database.  The current
logic
>> >>>>> > seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly
not
>> MySQL.
>> >>>>> >  I will see if I can find a solution.  The ticket for this
>> >>>>> > CONNECTORS-584.
>> >>>>> >
>> >>>>> > Karl
>> >>>>> >
>> >>>>> > On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
>> >>>>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>>>> >>
>> >>>>> >> Hi.
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> I downloaded MCF1.1dev on Nov, 29th, and ran it using
MySQL
>> >>>>> >> I tried to crawl 10 million files using Windows share
connection
>> and
>> >>>>> >> index
>> >>>>> >> them into Solr.
>> >>>>> >>
>> >>>>> >> As MCF reached over 1 million files, the crawling speed
started
>> getting
>> >>>>> >> slower.
>> >>>>> >> So I checked slow queries and found out that too many
slow
>> queries
>> >>>>> >> occurred,
>> >>>>> >> especially the following kinds:
>> >>>>> >>
>> >>>>> >> --------------------------------------------------------
>> >>>>> >> # Time: 121204 16:25:40
>> >>>>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>> >>>>> >> # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent:
1200
>> >>>>> >> Rows_examined:
>> >>>>> >> 611091
>> >>>>> >> SET timestamp=1354605940;
>> >>>>> >> SELECT
>> >>>>> >>
>> >>>>> >> t0.id
>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> >>>>> >> FROM jobqueue t0 WHERE t0.status IN ('P','G') AND
>> t0.checkaction='R'
>> >>>>> >> AND
>> >>>>> >> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x'
FROM jobs t1
>> WHERE
>> >>>>> >> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5)
>> AND NOT
>> >>>>> >> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash
>> AND
>> >>>>> >> t2.status
>> >>>>> >> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid)
AND NOT
>> >>>>> >> EXISTS(SELECT
>> >>>>> >> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner
AND
>> >>>>> >> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
>> >>>>> >> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
>> >>>>> >> # Time: 121204 16:25:44
>> >>>>> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>> >>>>> >> # Query_time: 3.064339  Lock_time: 0.000084 Rows_sent:
1
>> >>>>> >> Rows_examined:
>> >>>>> >> 406359
>> >>>>> >> SET timestamp=1354605944;
>> >>>>> >> SELECT docpriority,jobid,dochash,docid FROM jobqueue
t0 WHERE
>> status IN
>> >>>>> >> ('P','G') AND checkaction='R' AND checktime<=1354605932817
AND
>> >>>>> >> EXISTS(SELECT
>> >>>>> >> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id
>> =t0.jobid)
>> >>>>> >> ORDER BY
>> >>>>> >> docpriority ASC,status ASC,checkaction ASC,checktime
ASC LIMIT 1;
>> >>>>> >> -------------------------------------------------------
>> >>>>> >>
>> >>>>> >> I wonder if the queries appropriately use index of
the table.
>> >>>>> >> As a result of EXPLAIN against the slow query, there
was
>> filesort.
>> >>>>> >> There seems to be some conditions that MySQL does not
use index
>> >>>>> >> depending on
>> >>>>> >> ORDER BY:
>> >>>>> >>  - Executing ORDER BY against multiple keys
>> >>>>> >>  - When keys selected from records are different from
keys used
>> by
>> >>>>> >> ORDER BY
>> >>>>> >>
>> >>>>> >> Since filesort was happening, fully scanning records
should be
>> having
>> >>>>> >> MCF
>> >>>>> >> slower.
>> >>>>> >>
>> >>>>> >> Do you think this could happen even in PostgreSQL or
HSQLDB?
>> >>>>> >> Do you think queries could be modified to use index
>> appropriately?
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> Regards,
>> >>>>> >>
>> >>>>> >> Shigeki
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>>
>
>
>
>
>
>

Mime
View raw message