manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <Daniel.Li...@developpement-durable.gouv.fr>
Subject Re: ManifoldCF + Postgresql - long freeze on job
Date Mon, 11 Feb 2019 09:16:27 GMT
Hello,

1/ The database we use is Postgresql version 9.6

2/ I will look at what is happening about the queries in the logs.

3/ We do a vacuum full analyse every 24 hours, for each table we adjust 
the reindex at the value 5000000 (in properties.xml) with the line :
  <property 
name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink" 
value="5000000" />

Is there an instruction that allows to disable the reindex requested by 
manifoldcf

thanks

Daniel


Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt 
user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org) 
a écrit :
> Hello,
>
> (1) What database are you using for this?  Some databases require 
> maintenance periodically or have other heavy usage constraints.
> (2) Every time a query takes more than an minute to execute, it is 
> logged, along with the query plan.  You need to look at the manifoldcf 
> log to see which queries are problematic before concluding anything.
> (3) For every database table, you can individually configure how many 
> table operations approximately occur before MCF re-analyzes the 
> table.  However, it's likely that you have the opposite problem: a bad 
> query plan for the query that queues documents for processing.  That 
> may mean more frequent analysis to prevent.  But we cannot tell that 
> until we understand what queries are taking a long time.
>
> Thanks,
> Karl
>
>
>
> On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET 
> <Daniel.Lirot@developpement-durable.gouv.fr 
> <mailto:Daniel.Lirot@developpement-durable.gouv.fr>> wrote:
>
>     Hello,
>
>     We use ManifoldCF v2.10, with postgresql (9.6) to crawl our websites.
>     this represents approximately 1.2 million documents.
>     We split the crawl into 4 jobs that distribute their results on 3
>     SOLR collections.
>     The crawl is powerful up to 500000 documents (25000 to 30000 docs
>     / hour) then the performance decreases strongly in progress, we
>     observe freezes very very long, you might think that the crawl is
>     stopped.
>     We suspect a reindexing, noticeably of the intrinsiclink table
>     which is very important 85 Million lines.
>     Is it possible to prohibit re-indexing controlled by manifoldCF?
>     An other idea ?
>
>     best Regards
>     LIROT daniel
>     -- 
>


Mime
View raw message