manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Fwd: Crawling new/updated files using Windows share connection
Date Mon, 21 Jan 2013 13:48:47 GMT
---------- Forwarded message ----------
From: Karl Wright <daddywri@gmail.com>
Date: Mon, Jan 21, 2013 at 8:48 AM
Subject: Re: Crawling new/updated files using Windows share connection
To: Shigeki Kobayashi <shigeki.kobayashi3@g.softbank.co.jp>


Hi Shigeki,

I reviewed the code in detail.  At the time CONNECTORS-290 was fixed,
all document priorities were set to null whenever a job was paused or
aborted, so what I suspected might be the problem cannot in fact
happen.

The most likely possible explanation for MySQL's behavior, therefore,
is that MySQL orders null docpriority values BEFORE all other rows in
the index it is using for queue stuffing.  I have no other way of
explaining why it thinks it needs to go through 6.5 million rows
before it gets to the ones that are active.

If this is the case, it may be possible to tell MySQL to order null
column values to the END instead of the beginning of the index.  I'll
do some research on this later and get back to you.

Thanks,
Karl



On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright <daddywri@gmail.com> wrote:
> Are there any large paused or aborted jobs present on the same
> ManifoldCF?  If so, can you tell me whether the job is paused, or
> aborted?  (I am betting paused...)
>
> Karl
>
> On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> Hi Karl,
>>
>>
>> Here is the explain. There isn't such sort...
>>
>> mysql> explain SELECT
>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND
>> t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x'
>> FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4
>> WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC
>> LIMIT 4800;
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>> | id | select_type        | table | type   | possible_keys
>> | key            | key_len | ref                     | rows | Extra       |
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>> |  1 | PRIMARY            | t0    | index  | NULL
>> | I1358228295210 | 25      | NULL                    | 4800 | Using where |
>> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1358228295216
>> | I1358228295216 | 8       | manifoldcf.t0.id        |    1 |             |
>> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
>> | PRIMARY        | 767     | manifoldcf.t3.eventname |    1 | Using index |
>> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
>> I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122     |
>> manifoldcf.t0.dochash   |    1 | Using where |
>> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1358228295219
>> | PRIMARY        | 8       | manifoldcf.t0.jobid     |    1 | Using where |
>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>> 5 rows in set (0.00 sec)
>>
>>
>> Regards,
>>
>>
>> Shigeki
>>
>>
>>
>> 2013/1/21 Karl Wright <daddywri@gmail.com>
>>>
>>>  takes too long
>>> MIME-Version: 1.0
>>> Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9
>>>
>>> --14dae9399bd9676e5704d3c356e9
>>> Content-Type: text/plain; charset="utf-8"
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Can you get an EXPLAIN for this query? It sounds like it is
>>> disregarding the hint for some reason.
>>>
>>> Karl
>>>
>>> Sent from my Windows Phone
>>> From: Shigeki Kobayashi
>>> Sent: 1/20/2013 9:37 PM
>>> To: user@manifoldcf.apache.org
>>> Subject: Re: Crawling new/updated files using Windows share connection
>>> takes too long
>>> Hi Karl.
>>>
>>> I configured MySQL 5.5 to run MCF this time.
>>> The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
>>> fixed
>>> the slow query using "FORCE INDEX". Solr is 4.0
>>>
>>> I thought is was fixed but the log shows that  the following are slow
>>> queries.
>>> -------------------------------------------------------------------
>>> # Time: 130120 11:41:10
>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>> # Query_time: 8.761087  Lock_time: 0.000163 Rows_sent: 17  Rows_examined:
>>> 6365233
>>> SET timestamp=1358649670;
>>> SELECT
>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G')
>>> AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT
>>> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
>>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
>>> ASC LIMIT 4800;
>>>
>>> # Time: 130120 11:41:18
>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>> # Query_time: 7.714277  Lock_time: 0.000123 Rows_sent: 0  Rows_examined:
>>> 6365182
>>> SET timestamp=1358649678;
>>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX
>>> (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND
>>> checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>>> t1.status
>>> IN ('A','a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;
>>>
>>> Regards,
>>>
>>>
>>> Shigeki
>>>
>>>
>>>
>>> 2013/1/18 Karl Wright <daddywri@gmail.com>
>>>
>>> > Hi Shigeki,
>>> >
>>> > What database is ManifoldCF configured to use in this case?  Do you
>>> > see any indication of slow queries in the ManifoldCF log?
>>> >
>>> >
>>> > Karl
>>> >
>>> > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
>>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>> > > Hello
>>> > >
>>> > >
>>> > > I would like some advice to improve crawling time of new/updated files
>>> > using
>>> > > Windows share connection.
>>> > >
>>> > > I crawl file in Windows server and index them into Solr.
>>> > >
>>> > > Currently, the second crawling of two hundred thousands files takes
>>> >  over 5
>>> > > hours, even though any files are not updated, created, deleted.
>>> > >
>>> > > I assume MCF does the following processes (let me know if I am wrong)
>>> > >
>>> > > - obtain updated time of a file
>>> > > - compare the updated time with the one MCF obtained last time
>>> > > crawling(
>>> > > probably stored in DB)
>>> > > - if they are different MCF recognizes the file is to be indexed.
>>> > >
>>> > > If the above processes are done for two thousands files, what part
of
>>> > > the
>>> > > processes could take time the most? obtaining updated time? reading
>>> > > data
>>> > > from DB? what could be done to increase the crawling time do you
>>> > > think?
>>> > >
>>> > > Please give me some advice.
>>> > >
>>> > >
>>> > > Regards,
>>> > >
>>> > > Shigeki
>>> > >
>>> > >
>>> >
>>>
>>> --14dae9399bd9676e5704d3c356e9
>>> Content-Type: text/html; charset="utf-8"
>>> Content-Transfer-Encoding: quoted-printable
>>>
>>> <html><head><meta content=3D"text/html; charset=3Dutf-8"
>>> http-equiv=3D"Cont=
>>> ent-Type"></head><body><div><div style=3D"font-family:
Calibri,sans-serif;
>>> =
>>> font-size: 11pt;"><br>Can you get an EXPLAIN for this query?&nbsp;
It
>>> sound=
>>> s like it is disregarding the hint for some
>>> reason.<br><br>Karl<br><br>Sent=
>>>  from my Windows Phone<br></div></div><hr><span style=3D"font-family:
>>> Tahom=
>>> a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span
>>> style=
>>> =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki
>>> Kobayashi</sp=
>>> an><br><span style=3D"font-family: Tahoma,sans-serif; font-size:
10pt;
>>> font=
>>> -weight: bold;">Sent: </span><span style=3D"font-family:
>>> Tahoma,sans-serif;=
>>>  font-size: 10pt;">1/20/2013 9:37 PM</span><br><span style=3D"font-family:
>>> =
>>> Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span
>>> st=
>>> yle=3D"font-family: Tahoma,sans-serif; font-size:
>>> 10pt;">user@manifoldcf.ap=
>>> ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif;
>>> font-size=
>>> : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family:
>>> Tah=
>>> oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using
>>> Wind=
>>> ows share connection takes too long</span><br><br></body></html><div
>>> dir=3D=
>>> "ltr">Hi Karl.<div><br></div><div>I configured MySQL
5.5 to run MCF this
>>> ti=
>>> me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec,
12th.
>>> ,=
>>>  which you fixed</div><div style>the slow query using &quot;FORCE
>>> INDEX&quo=
>>> t;. Solr is 4.0</div>
>>>
>>> <div style><br></div><div style>I thought is was fixed
but the log shows
>>> th=
>>> at =C2=A0the following are slow queries.=C2=A0</div><div><div
>>> class=3D"gmai=
>>>
>>> l_extra">------------------------------------------------------------------=
>>> -</div>
>>> <div class=3D"gmail_extra">
>>> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div
>>> class=3D"gmail=
>>> _extra"># User@Host: manifoldcf[manifoldcf] @ localhost
>>> [127.0.0.1]</div><d=
>>> iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163
>>> R=
>>> ows_sent: 17 =C2=A0Rows_examined: 6365233</div>
>>>
>>> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div
>>> class=3D"g=
>>> mail_extra">SELECT <a
>>> href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t=
>>> 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0
>>> =
>>> FORCE INDEX (i1358228295210) WHERE t0.status IN (&#39;P&#39;,&#39;G&#39;)
>>> A=
>>> ND t0.checkaction=3D&#39;R&#39; AND t0.checktime&lt;=3D1358649661663
AND
>>> EX=
>>> ISTS(SELECT &#39;x&#39; FROM jobs t1 WHERE t1.status IN
>>> (&#39;A&#39;,&#39;a=
>>> &#39;) AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid AND
>>> t1.priority=3D=
>>> 5) AND NOT EXISTS(SELECT &#39;x&#39; FROM jobqueue t2 WHERE
>>> t2.dochash=3Dt0=
>>> .dochash AND t2.status IN
>>> (&#39;A&#39;,&#39;F&#39;,&#39;a&#39;,&#39;f&#39;,=
>>> &#39;D&#39;,&#39;d&#39;) AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT
>>> &#=
>>> 39;x&#39; FROM prereqevents t3,events t4 WHERE <a
>>> href=3D"http://t0.id">t0.=
>>> id</a>=3Dt3.owner AND t3.eventname=3D<a
>>> href=3D"http://t4.name">t4.name</a>=
>>> ) ORDER BY t0.docpriority ASC LIMIT 4800;</div>
>>>
>>> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">#
Time:
>>> 130=
>>> 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host:
>>> manifoldcf[ma=
>>> nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra">#
>>> Que=
>>> ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0
>>> =C2=A0Rows_examine=
>>> d: 6365182</div>
>>>
>>> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div
>>> class=3D"g=
>>> mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE
>>> I=
>>> NDEX (i1358228295210) WHERE status IN (&#39;P&#39;,&#39;G&#39;)
AND
>>> checkac=
>>> tion=3D&#39;R&#39; AND checktime&lt;=3D1358649661663 AND EXISTS(SELECT
>>> &#39=
>>> ;x&#39; FROM jobs t1 WHERE t1.status IN (&#39;A&#39;,&#39;a&#39;)
AND <a
>>> hr=
>>> ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority
ASC
>>> LI=
>>> MIT 1;</div>
>>>
>>> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div
>>> class=3D=
>>> "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div
>>> class=3D"=
>>> gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div
>>> c=
>>> lass=3D"gmail_extra">
>>>
>>> <br></div><br><div class=3D"gmail_quote">2013/1/18 Karl
Wright <span
>>> dir=3D=
>>> "ltr">&lt;<a href=3D"mailto:daddywri@gmail.com"
>>> target=3D"_blank">daddywri@=
>>> gmail.com</a>&gt;</span><br><blockquote class=3D"gmail_quote"
>>> style=3D"marg=
>>> in:0px 0px 0px
>>> 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20=
>>> 4);border-left-style:solid;padding-left:1ex">
>>>
>>> Hi Shigeki,<br>
>>> <br>
>>> What database is ManifoldCF configured to use in this case? =C2=A0Do
>>> you<br=
>>> >
>>> see any indication of slow queries in the ManifoldCF log?<br>
>>> <br>
>>> <br>
>>> Karl<br>
>>> <br>
>>> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br>
>>> &lt;<a
>>> href=3D"mailto:shigeki.kobayashi3@g.softbank.co.jp">shigeki.kobayash=
>>> i3@g.softbank.co.jp</a>&gt; wrote:<br>
>>> &gt; Hello<br>
>>> &gt;<br>
>>> &gt;<br>
>>> &gt; I would like some advice to improve crawling time of new/updated
>>> files=
>>>  using<br>
>>> &gt; Windows share connection.<br>
>>> &gt;<br>
>>> &gt; I crawl file in Windows server and index them into Solr.<br>
>>> &gt;<br>
>>> &gt; Currently, the second crawling of two hundred thousands files takes
=
>>> =C2=A0over 5<br>
>>> &gt; hours, even though any files are not updated, created, deleted.<br>
>>> &gt;<br>
>>> &gt; I assume MCF does the following processes (let me know if I am
>>> wrong)<=
>>> br>
>>> &gt;<br>
>>> &gt; - obtain updated time of a file<br>
>>> &gt; - compare the updated time with the one MCF obtained last time
>>> crawlin=
>>> g(<br>
>>> &gt; probably stored in DB)<br>
>>> &gt; - if they are different MCF recognizes the file is to be indexed.<br>
>>> &gt;<br>
>>> &gt; If the above processes are done for two thousands files, what part of
>>> =
>>> the<br>
>>> &gt; processes could take time the most? obtaining updated time? reading
>>> da=
>>> ta<br>
>>> &gt; from DB? what could be done to increase the crawling time do you
>>> think=
>>> ?<br>
>>> &gt;<br>
>>> &gt; Please give me some advice.<br>
>>> &gt;<br>
>>> &gt;<br>
>>> &gt; Regards,<br>
>>> &gt;<br>
>>> &gt; Shigeki<br>
>>> &gt;<br>
>>> &gt;<br>
>>> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font
>>> face=
>>> =3D"&#39;ms gothic&#39;, monospace">=C2=A0</font></div><div><br></div>
>>> </div></div></div>
>>>
>>> --14dae9399bd9676e5704d3c356e9--
>>
>>
>>
>>
>>

Mime
View raw message