manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Rees <tr...@chiliad.com>
Subject Update on a now-fixed old problem and questions about database usage
Date Wed, 21 May 2014 20:31:57 GMT
Dear ManifoldCF:

First, I would like to report that switching to ManifoldCF 1.6 solved a
problem I encountered with version 1.4.1: whenever I ran two web crawls
simultaneously the two crawls would stop progressing within a half an hour.
The 1.6 version works beautifully. Thank you for the excellent work.

Now I have a couple issues with the database that I would appreciate your
feedback on. First, the two crawls that I mentioned finished and pulled
down a little over 255,000 documents. The size of the postgres (version
9.3.2) database on the disk, however, expanded to use a little over 8 GB of
space, and this is after running a full vacuum. This seems like a lot of
space for two medium sized crawls. Is there a way to get the web crawler to
use less database space?

Secondly, when I ran two simultaneous web crawls with the NULL output
connector, the crawls worked without issue. When I ran the same two
simultaneous web crawls with a custom output connector that wrote the files
to a local file system everything worked fine. However, when I used an
output connector that wrote the downloaded files to a file system and put
the path to each file on an ActiveMQ JMS queue, then the crawl showed
quirky behavior. A few times the crawls stopped in their tracks and then
after 40 - 60 minutes a message was printed to the logfile saying that the
SQL queries took too long. The full dump of one set of these messages is
below, at the end of this email. The web crawls always recover, and they
are still running. I am using postgres 9.3.2 with manifoldcf, and so far it
has not had many issues, except for the occasional SQL taking too long
message, although these are infrequent. Do I need to use a different
version of postgres? Or make some other change?

Thank you for you help.

Tom Rees
Chiliad

WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running
query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
IN(SELECT ownerid FROM hopdeleted
eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
t1.parentidhash=t0.parentidhash AND
 t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2:
'1400623413113'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3:
'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running
query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
IN(SELECT ownerid FROM hopdeletede
ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM
intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
t1.parentidhash=t0.parentidhash AND
t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2:
'1400623413113'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3:
'D942516DE5623A6417FCB994186B507E8CDA30D6'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running
query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=?
AND parentidhash IN (?,?,?,?,?,?
,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0:
'1400623413113'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1:
'054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2:
'0774E538282FCA04F0FF95AC65D48EFC57CC6225'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3:
'1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4:
'1382DE9902A7CCC0012F043077E1739867CE00A4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5:
'2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6:
'34741F8B2706BCB202FDA72DABB94D916D497CD4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7:
'6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8:
'71B865B0979B351279EFD9F99CA8AF700704400A'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9:
'77C6E57EBDD811027F776BF895E0B43275AF3628'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10:
'8267055C5CE6D7A1917F88B1FA310FC5082FD599'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11:
'8F361A3EDA0CAC989812623441DA02BD42883C4F'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12:
'956CCECF3FD5F508624E19270FD5EC28532B0922'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13:
'9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14:
'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15:
'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16:
'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17:
'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18:
'D807BE55355A53CA84B4163F42081A896B323A81'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19:
'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20:
'FE4A24472BD3648F839FFAB7B5476915504A9755'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22:
'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on
hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on
hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested
Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested
Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->
 HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
 HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:               ->
 Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND
((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    ->  Index Scan using i1400371486543 on hopdeletedeps t0
 (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    ->  Index Scan using i1400371486543 on hopdeletedeps t0
 (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
          Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
          Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
    ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
    ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
          ->  Index Scan using i1400371486547 on intrinsiclink t1
 (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
                Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text)
AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
          ->  Index Scan using i1400371486547 on intrinsiclink t1
 (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->
 Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
                Index Cond: ((jobid = 1400623413113::bigint) AND
((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text)
AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->
 Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1
width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:
Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:
Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -
 WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows
 (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0
loops=1)

Mime
View raw message