manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Crawling new/updated files using Windows share connection
Date Mon, 21 Jan 2013 14:08:09 GMT
Bad news, I am afraid.  MySQL seems to always put null values at the
front of the index, and that cannot be changed through any means I can
find.  This is different from all other databases I know of.

The only possible fixes for the problem are as follows:

(1) Not use a null doc priority but instead use an actual special
number that is guaranteed to always sort to the end.  This would be
non-trivial because this column is a FLOAT value, and round-off errors
will prevent the ManifoldCF code from reliably using a special number
like that on all databases.

(2) Use docpriorities that are ordered in the opposite way - which
would work ONLY for MySQL and would break all other databases.

I'll create a ticket and think about the problem some more.

Karl

On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Shigeki,
>
> I reviewed the code in detail.  At the time CONNECTORS-290 was fixed,
> all document priorities were set to null whenever a job was paused or
> aborted, so what I suspected might be the problem cannot in fact
> happen.
>
> The most likely possible explanation for MySQL's behavior, therefore,
> is that MySQL orders null docpriority values BEFORE all other rows in
> the index it is using for queue stuffing.  I have no other way of
> explaining why it thinks it needs to go through 6.5 million rows
> before it gets to the ones that are active.
>
> If this is the case, it may be possible to tell MySQL to order null
> column values to the END instead of the beginning of the index.  I'll
> do some research on this later and get back to you.
>
> Thanks,
> Karl
>
>
>
> On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Are there any large paused or aborted jobs present on the same
>> ManifoldCF?  If so, can you tell me whether the job is paused, or
>> aborted?  (I am betting paused...)
>>
>> Karl
>>
>> On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi
>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>> Hi Karl,
>>>
>>>
>>> Here is the explain. There isn't such sort...
>>>
>>> mysql> explain SELECT
>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND
>>> t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x'
>>> FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4
>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC
>>> LIMIT 4800;
>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>> | id | select_type        | table | type   | possible_keys
>>> | key            | key_len | ref                     | rows | Extra       |
>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>> |  1 | PRIMARY            | t0    | index  | NULL
>>> | I1358228295210 | 25      | NULL                    | 4800 | Using where |
>>> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1358228295216
>>> | I1358228295216 | 8       | manifoldcf.t0.id        |    1 |             |
>>> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
>>> | PRIMARY        | 767     | manifoldcf.t3.eventname |    1 | Using index |
>>> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
>>> I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122     |
>>> manifoldcf.t0.dochash   |    1 | Using where |
>>> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1358228295219
>>> | PRIMARY        | 8       | manifoldcf.t0.jobid     |    1 | Using where |
>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>> 5 rows in set (0.00 sec)
>>>
>>>
>>> Regards,
>>>
>>>
>>> Shigeki
>>>
>>>
>>>
>>> 2013/1/21 Karl Wright <daddywri@gmail.com>
>>>>
>>>>  takes too long
>>>> MIME-Version: 1.0
>>>> Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9
>>>>
>>>> --14dae9399bd9676e5704d3c356e9
>>>> Content-Type: text/plain; charset="utf-8"
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> Can you get an EXPLAIN for this query? It sounds like it is
>>>> disregarding the hint for some reason.
>>>>
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> From: Shigeki Kobayashi
>>>> Sent: 1/20/2013 9:37 PM
>>>> To: user@manifoldcf.apache.org
>>>> Subject: Re: Crawling new/updated files using Windows share connection
>>>> takes too long
>>>> Hi Karl.
>>>>
>>>> I configured MySQL 5.5 to run MCF this time.
>>>> The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
>>>> fixed
>>>> the slow query using "FORCE INDEX". Solr is 4.0
>>>>
>>>> I thought is was fixed but the log shows that  the following are slow
>>>> queries.
>>>> -------------------------------------------------------------------
>>>> # Time: 130120 11:41:10
>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>> # Query_time: 8.761087  Lock_time: 0.000163 Rows_sent: 17  Rows_examined:
>>>> 6365233
>>>> SET timestamp=1358649670;
>>>> SELECT
>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G')
>>>> AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT
>>>> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
>>>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
>>>> ASC LIMIT 4800;
>>>>
>>>> # Time: 130120 11:41:18
>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>> # Query_time: 7.714277  Lock_time: 0.000123 Rows_sent: 0  Rows_examined:
>>>> 6365182
>>>> SET timestamp=1358649678;
>>>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX
>>>> (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND
>>>> checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>>>> t1.status
>>>> IN ('A','a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;
>>>>
>>>> Regards,
>>>>
>>>>
>>>> Shigeki
>>>>
>>>>
>>>>
>>>> 2013/1/18 Karl Wright <daddywri@gmail.com>
>>>>
>>>> > Hi Shigeki,
>>>> >
>>>> > What database is ManifoldCF configured to use in this case?  Do you
>>>> > see any indication of slow queries in the ManifoldCF log?
>>>> >
>>>> >
>>>> > Karl
>>>> >
>>>> > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
>>>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>>> > > Hello
>>>> > >
>>>> > >
>>>> > > I would like some advice to improve crawling time of new/updated
files
>>>> > using
>>>> > > Windows share connection.
>>>> > >
>>>> > > I crawl file in Windows server and index them into Solr.
>>>> > >
>>>> > > Currently, the second crawling of two hundred thousands files takes
>>>> >  over 5
>>>> > > hours, even though any files are not updated, created, deleted.
>>>> > >
>>>> > > I assume MCF does the following processes (let me know if I am
wrong)
>>>> > >
>>>> > > - obtain updated time of a file
>>>> > > - compare the updated time with the one MCF obtained last time
>>>> > > crawling(
>>>> > > probably stored in DB)
>>>> > > - if they are different MCF recognizes the file is to be indexed.
>>>> > >
>>>> > > If the above processes are done for two thousands files, what part
of
>>>> > > the
>>>> > > processes could take time the most? obtaining updated time? reading
>>>> > > data
>>>> > > from DB? what could be done to increase the crawling time do you
>>>> > > think?
>>>> > >
>>>> > > Please give me some advice.
>>>> > >
>>>> > >
>>>> > > Regards,
>>>> > >
>>>> > > Shigeki
>>>> > >
>>>> > >
>>>> >
>>>>
>>>> --14dae9399bd9676e5704d3c356e9
>>>> Content-Type: text/html; charset="utf-8"
>>>> Content-Transfer-Encoding: quoted-printable
>>>>
>>>> <html><head><meta content=3D"text/html; charset=3Dutf-8"
>>>> http-equiv=3D"Cont=
>>>> ent-Type"></head><body><div><div style=3D"font-family:
Calibri,sans-serif;
>>>> =
>>>> font-size: 11pt;"><br>Can you get an EXPLAIN for this query?&nbsp;
It
>>>> sound=
>>>> s like it is disregarding the hint for some
>>>> reason.<br><br>Karl<br><br>Sent=
>>>>  from my Windows Phone<br></div></div><hr><span
style=3D"font-family:
>>>> Tahom=
>>>> a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span
>>>> style=
>>>> =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki
>>>> Kobayashi</sp=
>>>> an><br><span style=3D"font-family: Tahoma,sans-serif; font-size:
10pt;
>>>> font=
>>>> -weight: bold;">Sent: </span><span style=3D"font-family:
>>>> Tahoma,sans-serif;=
>>>>  font-size: 10pt;">1/20/2013 9:37 PM</span><br><span style=3D"font-family:
>>>> =
>>>> Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span
>>>> st=
>>>> yle=3D"font-family: Tahoma,sans-serif; font-size:
>>>> 10pt;">user@manifoldcf.ap=
>>>> ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif;
>>>> font-size=
>>>> : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family:
>>>> Tah=
>>>> oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using
>>>> Wind=
>>>> ows share connection takes too long</span><br><br></body></html><div
>>>> dir=3D=
>>>> "ltr">Hi Karl.<div><br></div><div>I configured
MySQL 5.5 to run MCF this
>>>> ti=
>>>> me.</div><div>The version of MCF is trunk 1.1dev downloaded on
Dec, 12th.
>>>> ,=
>>>>  which you fixed</div><div style>the slow query using &quot;FORCE
>>>> INDEX&quo=
>>>> t;. Solr is 4.0</div>
>>>>
>>>> <div style><br></div><div style>I thought is was
fixed but the log shows
>>>> th=
>>>> at =C2=A0the following are slow queries.=C2=A0</div><div><div
>>>> class=3D"gmai=
>>>>
>>>> l_extra">------------------------------------------------------------------=
>>>> -</div>
>>>> <div class=3D"gmail_extra">
>>>> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div
>>>> class=3D"gmail=
>>>> _extra"># User@Host: manifoldcf[manifoldcf] @ localhost
>>>> [127.0.0.1]</div><d=
>>>> iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163
>>>> R=
>>>> ows_sent: 17 =C2=A0Rows_examined: 6365233</div>
>>>>
>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div
>>>> class=3D"g=
>>>> mail_extra">SELECT <a
>>>> href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t=
>>>> 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0
>>>> =
>>>> FORCE INDEX (i1358228295210) WHERE t0.status IN (&#39;P&#39;,&#39;G&#39;)
>>>> A=
>>>> ND t0.checkaction=3D&#39;R&#39; AND t0.checktime&lt;=3D1358649661663
AND
>>>> EX=
>>>> ISTS(SELECT &#39;x&#39; FROM jobs t1 WHERE t1.status IN
>>>> (&#39;A&#39;,&#39;a=
>>>> &#39;) AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid
AND
>>>> t1.priority=3D=
>>>> 5) AND NOT EXISTS(SELECT &#39;x&#39; FROM jobqueue t2 WHERE
>>>> t2.dochash=3Dt0=
>>>> .dochash AND t2.status IN
>>>> (&#39;A&#39;,&#39;F&#39;,&#39;a&#39;,&#39;f&#39;,=
>>>> &#39;D&#39;,&#39;d&#39;) AND t2.jobid!=3Dt0.jobid) AND NOT
EXISTS(SELECT
>>>> &#=
>>>> 39;x&#39; FROM prereqevents t3,events t4 WHERE <a
>>>> href=3D"http://t0.id">t0.=
>>>> id</a>=3Dt3.owner AND t3.eventname=3D<a
>>>> href=3D"http://t4.name">t4.name</a>=
>>>> ) ORDER BY t0.docpriority ASC LIMIT 4800;</div>
>>>>
>>>> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">#
Time:
>>>> 130=
>>>> 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host:
>>>> manifoldcf[ma=
>>>> nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra">#
>>>> Que=
>>>> ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0
>>>> =C2=A0Rows_examine=
>>>> d: 6365182</div>
>>>>
>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div
>>>> class=3D"g=
>>>> mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE
>>>> I=
>>>> NDEX (i1358228295210) WHERE status IN (&#39;P&#39;,&#39;G&#39;)
AND
>>>> checkac=
>>>> tion=3D&#39;R&#39; AND checktime&lt;=3D1358649661663 AND EXISTS(SELECT
>>>> &#39=
>>>> ;x&#39; FROM jobs t1 WHERE t1.status IN (&#39;A&#39;,&#39;a&#39;)
AND <a
>>>> hr=
>>>> ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority
ASC
>>>> LI=
>>>> MIT 1;</div>
>>>>
>>>> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div
>>>> class=3D=
>>>> "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div
>>>> class=3D"=
>>>> gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div
>>>> c=
>>>> lass=3D"gmail_extra">
>>>>
>>>> <br></div><br><div class=3D"gmail_quote">2013/1/18
Karl Wright <span
>>>> dir=3D=
>>>> "ltr">&lt;<a href=3D"mailto:daddywri@gmail.com"
>>>> target=3D"_blank">daddywri@=
>>>> gmail.com</a>&gt;</span><br><blockquote class=3D"gmail_quote"
>>>> style=3D"marg=
>>>> in:0px 0px 0px
>>>> 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20=
>>>> 4);border-left-style:solid;padding-left:1ex">
>>>>
>>>> Hi Shigeki,<br>
>>>> <br>
>>>> What database is ManifoldCF configured to use in this case? =C2=A0Do
>>>> you<br=
>>>> >
>>>> see any indication of slow queries in the ManifoldCF log?<br>
>>>> <br>
>>>> <br>
>>>> Karl<br>
>>>> <br>
>>>> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br>
>>>> &lt;<a
>>>> href=3D"mailto:shigeki.kobayashi3@g.softbank.co.jp">shigeki.kobayash=
>>>> i3@g.softbank.co.jp</a>&gt; wrote:<br>
>>>> &gt; Hello<br>
>>>> &gt;<br>
>>>> &gt;<br>
>>>> &gt; I would like some advice to improve crawling time of new/updated
>>>> files=
>>>>  using<br>
>>>> &gt; Windows share connection.<br>
>>>> &gt;<br>
>>>> &gt; I crawl file in Windows server and index them into Solr.<br>
>>>> &gt;<br>
>>>> &gt; Currently, the second crawling of two hundred thousands files takes
=
>>>> =C2=A0over 5<br>
>>>> &gt; hours, even though any files are not updated, created, deleted.<br>
>>>> &gt;<br>
>>>> &gt; I assume MCF does the following processes (let me know if I am
>>>> wrong)<=
>>>> br>
>>>> &gt;<br>
>>>> &gt; - obtain updated time of a file<br>
>>>> &gt; - compare the updated time with the one MCF obtained last time
>>>> crawlin=
>>>> g(<br>
>>>> &gt; probably stored in DB)<br>
>>>> &gt; - if they are different MCF recognizes the file is to be indexed.<br>
>>>> &gt;<br>
>>>> &gt; If the above processes are done for two thousands files, what part
of
>>>> =
>>>> the<br>
>>>> &gt; processes could take time the most? obtaining updated time? reading
>>>> da=
>>>> ta<br>
>>>> &gt; from DB? what could be done to increase the crawling time do you
>>>> think=
>>>> ?<br>
>>>> &gt;<br>
>>>> &gt; Please give me some advice.<br>
>>>> &gt;<br>
>>>> &gt;<br>
>>>> &gt; Regards,<br>
>>>> &gt;<br>
>>>> &gt; Shigeki<br>
>>>> &gt;<br>
>>>> &gt;<br>
>>>> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font
>>>> face=
>>>> =3D"&#39;ms gothic&#39;, monospace">=C2=A0</font></div><div><br></div>
>>>> </div></div></div>
>>>>
>>>> --14dae9399bd9676e5704d3c356e9--
>>>
>>>
>>>
>>>
>>>

Mime
View raw message