manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Update on a now-fixed old problem and questions about database usage
Date Wed, 21 May 2014 20:38:00 GMT
Hi Tom,

What you are seeing is the result of hopcount logic, together with
ManifoldCF's periodic analysis and reindexing of sensitive tables.
Hopcount tracking in ManifoldCF is expensive, and if you don't actually
need it, I suggest you disable it (in your job, select "keep forever").
Periodically MCF finds a document which causes the hopcount of many already
crawled documents to shrink.  The effect of this is a great deal of
database activity.  And, of course, while this is going on, MCF may well
decide it's time to reindex, which slows things down even more.

Karl



On Wed, May 21, 2014 at 4:31 PM, Tom Rees <trees@chiliad.com> wrote:

> Dear ManifoldCF:
>
> First, I would like to report that switching to ManifoldCF 1.6 solved a
> problem I encountered with version 1.4.1: whenever I ran two web crawls
> simultaneously the two crawls would stop progressing within a half an hour.
> The 1.6 version works beautifully. Thank you for the excellent work.
>
> Now I have a couple issues with the database that I would appreciate your
> feedback on. First, the two crawls that I mentioned finished and pulled
> down a little over 255,000 documents. The size of the postgres (version
> 9.3.2) database on the disk, however, expanded to use a little over 8 GB of
> space, and this is after running a full vacuum. This seems like a lot of
> space for two medium sized crawls. Is there a way to get the web crawler to
> use less database space?
>
> Secondly, when I ran two simultaneous web crawls with the NULL output
> connector, the crawls worked without issue. When I ran the same two
> simultaneous web crawls with a custom output connector that wrote the files
> to a local file system everything worked fine. However, when I used an
> output connector that wrote the downloaded files to a file system and put
> the path to each file on an ActiveMQ JMS queue, then the crawl showed
> quirky behavior. A few times the crawls stopped in their tracks and then
> after 40 - 60 minutes a message was printed to the logfile saying that the
> SQL queries took too long. The full dump of one set of these messages is
> below, at the end of this email. The web crawls always recover, and they
> are still running. I am using postgres 9.3.2 with manifoldcf, and so far it
> has not had many issues, except for the occasional SQL taking too long
> message, although these are infrequent. Do I need to use a different
> version of postgres? Or make some other change?
>
> Thank you for you help.
>
> Tom Rees
> Chiliad
>
> WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running
> query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeleted
> eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
>  t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3:
> 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
>  WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running
> query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletede
> ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
> intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND
> t1.childidhash=t0.childidhash AND t1.isnew=?))]
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3:
> 'D942516DE5623A6417FCB994186B507E8CDA30D6'
>  WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running
> query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=?
> AND parentidhash IN (?,?,?,?,?,?
> ,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0:
> '1400623413113'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1:
> '054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2:
> '0774E538282FCA04F0FF95AC65D48EFC57CC6225'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3:
> '1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4:
> '1382DE9902A7CCC0012F043077E1739867CE00A4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5:
> '2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6:
> '34741F8B2706BCB202FDA72DABB94D916D497CD4'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7:
> '6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8:
> '71B865B0979B351279EFD9F99CA8AF700704400A'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9:
> '77C6E57EBDD811027F776BF895E0B43275AF3628'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10:
> '8267055C5CE6D7A1917F88B1FA310FC5082FD599'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11:
> '8F361A3EDA0CAC989812623441DA02BD42883C4F'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12:
> '956CCECF3FD5F508624E19270FD5EC28532B0922'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13:
> '9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14:
> 'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16:
> 'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17:
> 'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18:
> 'D807BE55355A53CA84B4163F42081A896B323A81'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19:
> 'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20:
> 'FE4A24472BD3648F839FFAB7B5476915504A9755'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
>  WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22:
> 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on
> hopcount  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested
> Loop  (cost=157.53..165.57 rows=1 width=81)
>  WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
> ((t0.parentidhash)::text = (t1.parentidhash)::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Index Scan using i1400371486543 on hopdeletedeps t0
>  (cost=0.56..55.95 rows=27 width=109)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>       ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>     ->  Hash  (cost=100.32..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>             ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
>                   Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>           ->  Index Scan using i1400371486547 on intrinsiclink t1
>  (cost=0.56..100.32 rows=42 width=101)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
>                 Index Cond: ((jobid = 1400623413113::bigint) AND
> ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)
> AND (isnew = 'B'::bpchar))
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
>  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
> width=69)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '28') -
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
> Index Cond: (id = t0.ownerid)
>  WARN 2014-05-21 11:05:08,290 (Worker thread '4') -
>  WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows
>  (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0
> loops=1)
>
>

Mime
View raw message