manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Crawling new/updated files using Windows share connection takes too long
Date Mon, 21 Jan 2013 02:36:28 GMT
Hi Karl.

I configured MySQL 5.5 to run MCF this time.
The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
fixed
the slow query using "FORCE INDEX". Solr is 4.0

I thought is was fixed but the log shows that  the following are slow
queries.
-------------------------------------------------------------------
# Time: 130120 11:41:10
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 8.761087  Lock_time: 0.000163 Rows_sent: 17  Rows_examined:
6365233
SET timestamp=1358649670;
SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G')
AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT
'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
ASC LIMIT 4800;

# Time: 130120 11:41:18
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 7.714277  Lock_time: 0.000123 Rows_sent: 0  Rows_examined:
6365182
SET timestamp=1358649678;
SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX
(i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND
checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status
IN ('A','a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;

Regards,


Shigeki



2013/1/18 Karl Wright <daddywri@gmail.com>

> Hi Shigeki,
>
> What database is ManifoldCF configured to use in this case?  Do you
> see any indication of slow queries in the ManifoldCF log?
>
>
> Karl
>
> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> > Hello
> >
> >
> > I would like some advice to improve crawling time of new/updated files
> using
> > Windows share connection.
> >
> > I crawl file in Windows server and index them into Solr.
> >
> > Currently, the second crawling of two hundred thousands files takes
>  over 5
> > hours, even though any files are not updated, created, deleted.
> >
> > I assume MCF does the following processes (let me know if I am wrong)
> >
> > - obtain updated time of a file
> > - compare the updated time with the one MCF obtained last time crawling(
> > probably stored in DB)
> > - if they are different MCF recognizes the file is to be indexed.
> >
> > If the above processes are done for two thousands files, what part of the
> > processes could take time the most? obtaining updated time? reading data
> > from DB? what could be done to increase the crawling time do you think?
> >
> > Please give me some advice.
> >
> >
> > Regards,
> >
> > Shigeki
> >
> >
>

Mime
View raw message