Dear ManifoldCF:

First, I would like to report that switching to ManifoldCF 1.6 solved a problem I encountered with version 1.4.1: whenever I ran two web crawls simultaneously the two crawls would stop progressing within a half an hour. The 1.6 version works beautifully. Thank you for the excellent work.

Now I have a couple issues with the database that I would appreciate your feedback on. First, the two crawls that I mentioned finished and pulled down a little over 255,000 documents. The size of the postgres (version 9.3.2) database on the disk, however, expanded to use a little over 8 GB of space, and this is after running a full vacuum. This seems like a lot of space for two medium sized crawls. Is there a way to get the web crawler to use less database space?

Secondly, when I ran two simultaneous web crawls with the NULL output connector, the crawls worked without issue. When I ran the same two simultaneous web crawls with a custom output connector that wrote the files to a local file system everything worked fine. However, when I used an output connector that wrote the downloaded files to a file system and put the path to each file on an ActiveMQ JMS queue, then the crawl showed quirky behavior. A few times the crawls stopped in their tracks and then after 40 - 60 minutes a message was printed to the logfile saying that the SQL queries took too long. The full dump of one set of these messages is below, at the end of this email. The web crawls always recover, and they are still running. I am using postgres 9.3.2 with manifoldcf, and so far it has not had many issues, except for the occasional SQL taking too long message, although these are infrequent. Do I need to use a different version of postgres? Or make some other change?

Thank you for you help.

Tom Rees
Chiliad

WARN 2014-05-21 11:05:08,230 (Worker thread '28') - Found a long-running query (2662579 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeleted
eps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND t1.parentidhash=t0.parentidhash AND
 t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,230 (Worker thread '28') -   Parameter 2: '1400623413113'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 3: 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'
 WARN 2014-05-21 11:05:08,231 (Worker thread '28') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') - Found a long-running query (2625296 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletede
ps t0 WHERE t0.jobid=? AND t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND t1.parentidhash=t0.parentidhash AND 
t1.childidhash=t0.childidhash AND t1.isnew=?))]
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 0: 'D'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 1: '-1'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 2: '1400623413113'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 3: 'D942516DE5623A6417FCB994186B507E8CDA30D6'
 WARN 2014-05-21 11:05:08,243 (Worker thread '4') -   Parameter 4: 'B'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') - Found a long-running query (2675765 ms): [SELECT parentidhash FROM intrinsiclink WHERE jobid=? AND parentidhash IN (?,?,?,?,?,?
,?,?,?,?,?,?,?,?,?,?,?,?,?,?) AND linktype=? AND childidhash=? FOR UPDATE]
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 0: '1400623413113'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 1: '054FC31ACF6FB96D2F8D19FF9CC230349E6A7A76'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 2: '0774E538282FCA04F0FF95AC65D48EFC57CC6225'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 3: '1027C9AF07AE2B419C31A1D3B20352E31867BBBB'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 4: '1382DE9902A7CCC0012F043077E1739867CE00A4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 5: '2E8844A26FCD3096DF0D6BC3BB3D6648FCBCA7FA'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 6: '34741F8B2706BCB202FDA72DABB94D916D497CD4'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 7: '6A5E47B467A29A8614B473856F1D28EC8B30F5F3'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 8: '71B865B0979B351279EFD9F99CA8AF700704400A'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 9: '77C6E57EBDD811027F776BF895E0B43275AF3628'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 10: '8267055C5CE6D7A1917F88B1FA310FC5082FD599'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 11: '8F361A3EDA0CAC989812623441DA02BD42883C4F'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 12: '956CCECF3FD5F508624E19270FD5EC28532B0922'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 13: '9BAA3731F101B3908E4FFF4A5325601C57B4CD57'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 14: 'AD628D16A2708EECD1C33AA0E63D849BCB5DF417'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 15: 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 16: 'D1F182BF5B49CB4FBF274A1B63B54C2F684EC059'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 17: 'D7FB0CB3AFE34BC258686368296AF0D896C5786E'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 18: 'D807BE55355A53CA84B4163F42081A896B323A81'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 19: 'EDED88E796389DEB5E8DA14F1FD56088CDA8BF98'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 20: 'FE4A24472BD3648F839FFAB7B5476915504A9755'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 21: 'link'
 WARN 2014-05-21 11:05:08,252 (Worker thread '40') -   Parameter 22: 'B661E6DD08FD89A6643A706ECAB6E1729FC623C8'
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan: Update on hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan: Update on hopcount  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:   ->  Nested Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '4') -  Plan:   ->  Nested Loop  (cost=157.53..165.57 rows=1 width=81)
 WARN 2014-05-21 11:05:08,289 (Worker thread '28') -  Plan:         ->  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:               ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->  HashAggregate  (cost=157.11..157.12 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                     Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND ((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:               ->  Hash Join  (cost=101.51..157.11 rows=1 width=20)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                     Hash Cond: (((t0.linktype)::text = (t1.linktype)::text) AND ((t0.parentidhash)::text = (t1.parentidhash)::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                     ->  Index Scan using i1400371486543 on hopdeletedeps t0  (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                     ->  Index Scan using i1400371486543 on hopdeletedeps t0  (cost=0.56..55.95 rows=27 width=109)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                           Index Cond: ((jobid = 1400623413113::bigint) AND ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                           Index Cond: ((jobid = 1400623413113::bigint) AND ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text))
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                     ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                     ->  Hash  (cost=100.32..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                           ->  Index Scan using i1400371486547 on intrinsiclink t1  (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:                                 Index Cond: ((jobid = 1400623413113::bigint) AND ((childidhash)::text = 'A2EB225081B47722CCAEB3293A28EEB2F264E02C'::text) AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                           ->  Index Scan using i1400371486547 on intrinsiclink t1  (cost=0.56..100.32 rows=42 width=101)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:         ->  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1 width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:                                 Index Cond: ((jobid = 1400623413113::bigint) AND ((childidhash)::text = 'D942516DE5623A6417FCB994186B507E8CDA30D6'::text) AND (isnew = 'B'::bpchar))
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:         ->  Index Scan using hopcount_pkey on hopcount  (cost=0.42..8.45 rows=1 width=69)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') -  Plan:               Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '28') - 
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') -  Plan:               Index Cond: (id = t0.ownerid)
 WARN 2014-05-21 11:05:08,290 (Worker thread '4') - 
 WARN 2014-05-21 11:05:08,294 (Worker thread '40') -  Plan: LockRows  (cost=0.56..101.40 rows=3 width=47) (actual time=0.041..0.041 rows=0 loops=1)