manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Crawling new/updated files using Windows share connection
Date Mon, 21 Jan 2013 06:49:07 GMT
 takes too long
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9

--14dae9399bd9676e5704d3c356e9
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

Can you get an EXPLAIN for this query? It sounds like it is
disregarding the hint for some reason.

Karl

Sent from my Windows Phone
From: Shigeki Kobayashi
Sent: 1/20/2013 9:37 PM
To: user@manifoldcf.apache.org
Subject: Re: Crawling new/updated files using Windows share connection
takes too long
Hi Karl.

I configured MySQL 5.5 to run MCF this time.
The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
fixed
the slow query using "FORCE INDEX". Solr is 4.0

I thought is was fixed but the log shows that  the following are slow
queries.
-------------------------------------------------------------------
# Time: 130120 11:41:10
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 8.761087  Lock_time: 0.000163 Rows_sent: 17  Rows_examined:
6365233
SET timestamp=1358649670;
SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G')
AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT
'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
ASC LIMIT 4800;

# Time: 130120 11:41:18
# User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
# Query_time: 7.714277  Lock_time: 0.000123 Rows_sent: 0  Rows_examined:
6365182
SET timestamp=1358649678;
SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX
(i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND
checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status
IN ('A','a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;

Regards,


Shigeki



2013/1/18 Karl Wright <daddywri@gmail.com>

> Hi Shigeki,
>
> What database is ManifoldCF configured to use in this case?  Do you
> see any indication of slow queries in the ManifoldCF log?
>
>
> Karl
>
> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> > Hello
> >
> >
> > I would like some advice to improve crawling time of new/updated files
> using
> > Windows share connection.
> >
> > I crawl file in Windows server and index them into Solr.
> >
> > Currently, the second crawling of two hundred thousands files takes
>  over 5
> > hours, even though any files are not updated, created, deleted.
> >
> > I assume MCF does the following processes (let me know if I am wrong)
> >
> > - obtain updated time of a file
> > - compare the updated time with the one MCF obtained last time crawling(
> > probably stored in DB)
> > - if they are different MCF recognizes the file is to be indexed.
> >
> > If the above processes are done for two thousands files, what part of the
> > processes could take time the most? obtaining updated time? reading data
> > from DB? what could be done to increase the crawling time do you think?
> >
> > Please give me some advice.
> >
> >
> > Regards,
> >
> > Shigeki
> >
> >
>

--14dae9399bd9676e5704d3c356e9
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

<html><head><meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Cont=
ent-Type"></head><body><div><div style=3D"font-family: Calibri,sans-serif;
=
font-size: 11pt;"><br>Can you get an EXPLAIN for this query?&nbsp; It sound=
s like it is disregarding the hint for some reason.<br><br>Karl<br><br>Sent=
 from my Windows Phone<br></div></div><hr><span style=3D"font-family:
Tahom=
a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span style=
=3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki Kobayashi</sp=
an><br><span style=3D"font-family: Tahoma,sans-serif; font-size: 10pt; font=
-weight: bold;">Sent: </span><span style=3D"font-family: Tahoma,sans-serif;=
 font-size: 10pt;">1/20/2013 9:37 PM</span><br><span style=3D"font-family:
=
Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span st=
yle=3D"font-family: Tahoma,sans-serif; font-size: 10pt;">user@manifoldcf.ap=
ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif; font-size=
: 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family: Tah=
oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using Wind=
ows share connection takes too long</span><br><br></body></html><div
dir=3D=
"ltr">Hi Karl.<div><br></div><div>I configured MySQL 5.5 to run
MCF this ti=
me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec, 12th. ,=
 which you fixed</div><div style>the slow query using &quot;FORCE INDEX&quo=
t;. Solr is 4.0</div>

<div style><br></div><div style>I thought is was fixed but the log
shows th=
at =C2=A0the following are slow queries.=C2=A0</div><div><div class=3D"gmai=
l_extra">------------------------------------------------------------------=
-</div>
<div class=3D"gmail_extra">
<div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div class=3D"gmail=
_extra"># User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]</div><d=
iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163 R=
ows_sent: 17 =C2=A0Rows_examined: 6365233</div>

<div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div class=3D"g=
mail_extra">SELECT <a href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t=
0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 =
FORCE INDEX (i1358228295210) WHERE t0.status IN (&#39;P&#39;,&#39;G&#39;)
A=
ND t0.checkaction=3D&#39;R&#39; AND t0.checktime&lt;=3D1358649661663 AND EX=
ISTS(SELECT &#39;x&#39; FROM jobs t1 WHERE t1.status IN (&#39;A&#39;,&#39;a=
&#39;) AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid AND t1.priority=3D=
5) AND NOT EXISTS(SELECT &#39;x&#39; FROM jobqueue t2 WHERE t2.dochash=3Dt0=
.dochash AND t2.status IN (&#39;A&#39;,&#39;F&#39;,&#39;a&#39;,&#39;f&#39;,=
&#39;D&#39;,&#39;d&#39;) AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT &#=
39;x&#39; FROM prereqevents t3,events t4 WHERE <a href=3D"http://t0.id">t0.=
id</a>=3Dt3.owner AND t3.eventname=3D<a href=3D"http://t4.name">t4.name</a>=
) ORDER BY t0.docpriority ASC LIMIT 4800;</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">#
Time: 130=
120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host: manifoldcf[ma=
nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra">#
Que=
ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0 =C2=A0Rows_examine=
d: 6365182</div>

<div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div class=3D"g=
mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE I=
NDEX (i1358228295210) WHERE status IN (&#39;P&#39;,&#39;G&#39;) AND checkac=
tion=3D&#39;R&#39; AND checktime&lt;=3D1358649661663 AND EXISTS(SELECT &#39=
;x&#39; FROM jobs t1 WHERE t1.status IN (&#39;A&#39;,&#39;a&#39;) AND
<a hr=
ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority ASC LI=
MIT 1;</div>

<div><br></div><div class=3D"gmail_extra" style>Regards,</div><div
class=3D=
"gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div
class=3D"=
gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div
c=
lass=3D"gmail_extra">

<br></div><br><div class=3D"gmail_quote">2013/1/18 Karl Wright <span
dir=3D=
"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_blank">daddywri@=
gmail.com</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20=
4);border-left-style:solid;padding-left:1ex">

Hi Shigeki,<br>
<br>
What database is ManifoldCF configured to use in this case? =C2=A0Do you<br=
>
see any indication of slow queries in the ManifoldCF log?<br>
<br>
<br>
Karl<br>
<br>
On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br>
&lt;<a href=3D"mailto:shigeki.kobayashi3@g.softbank.co.jp">shigeki.kobayash=
i3@g.softbank.co.jp</a>&gt; wrote:<br>
&gt; Hello<br>
&gt;<br>
&gt;<br>
&gt; I would like some advice to improve crawling time of new/updated files=
 using<br>
&gt; Windows share connection.<br>
&gt;<br>
&gt; I crawl file in Windows server and index them into Solr.<br>
&gt;<br>
&gt; Currently, the second crawling of two hundred thousands files takes =
=C2=A0over 5<br>
&gt; hours, even though any files are not updated, created, deleted.<br>
&gt;<br>
&gt; I assume MCF does the following processes (let me know if I am wrong)<=
br>
&gt;<br>
&gt; - obtain updated time of a file<br>
&gt; - compare the updated time with the one MCF obtained last time crawlin=
g(<br>
&gt; probably stored in DB)<br>
&gt; - if they are different MCF recognizes the file is to be indexed.<br>
&gt;<br>
&gt; If the above processes are done for two thousands files, what part of =
the<br>
&gt; processes could take time the most? obtaining updated time? reading da=
ta<br>
&gt; from DB? what could be done to increase the crawling time do you think=
?<br>
&gt;<br>
&gt; Please give me some advice.<br>
&gt;<br>
&gt;<br>
&gt; Regards,<br>
&gt;<br>
&gt; Shigeki<br>
&gt;<br>
&gt;<br>
</blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font
face=
=3D"&#39;ms gothic&#39;, monospace">=C2=A0</font></div><div><br></div>
</div></div></div>

--14dae9399bd9676e5704d3c356e9--

Mime
View raw message