manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LIROT Daniel - SG/SPSSI/CPII/DOSO/ET <Daniel.Li...@developpement-durable.gouv.fr>
Subject Re: ManifoldCF + Postgresql - long freeze on job
Date Mon, 11 Feb 2019 12:35:56 GMT
Hi,

We see the table "Advanced properties.xml properties", we use it to 
parametrized :
"<property 
name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink" 
value="5000000" />" for the intrinsiclink table, and we do the same for 
the other tables,
but is there a value that allows to disable the reindex and the analyze, 
for example "-1" or "0", i didn't find it in the documentation.

Thank you


Le 11/02/2019 à 12:26, > Karl Wright (par Internet, dépôt 
user-return-5690-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org) 
a écrit :
> See: 
> https://manifoldcf.apache.org/release/release-1.10/en_US/how-to-build-and-deploy.html#file+properties
>
> Look at the table "Advanced properties.xml properties"
>
> Karl
>
>
> On Mon, Feb 11, 2019 at 4:16 AM LIROT Daniel - SG/SPSSI/CPII/DOSO/ET 
> <Daniel.Lirot@developpement-durable.gouv.fr 
> <mailto:Daniel.Lirot@developpement-durable.gouv.fr>> wrote:
>
>     Hello,
>
>     1/ The database we use is Postgresql version 9.6
>
>     2/ I will look at what is happening about the queries in the logs.
>
>     3/ We do a vacuum full analyse every 24 hours, for each table we
>     adjust the reindex at the value 5000000 (in properties.xml) with
>     the line :
>      <property
>     name="org.apache.manifoldcf.db.postgres.reindex.intrinsiclink"
>     value="5000000" />
>
>     Is there an instruction that allows to disable the reindex
>     requested by manifoldcf
>
>     thanks
>
>     Daniel
>
>
>     Le 08/02/2019 à 16:00, > Karl Wright (par Internet, dépôt
>     user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org
>     <mailto:user-return-5674-daniel.lirot=developpement-durable.gouv.fr@manifoldcf.apache.org>)
>     a écrit :
>>     Hello,
>>
>>     (1) What database are you using for this?  Some databases require
>>     maintenance periodically or have other heavy usage constraints.
>>     (2) Every time a query takes more than an minute to execute, it
>>     is logged, along with the query plan.  You need to look at the
>>     manifoldcf log to see which queries are problematic before
>>     concluding anything.
>>     (3) For every database table, you can individually configure how
>>     many table operations approximately occur before MCF re-analyzes
>>     the table.  However, it's likely that you have the opposite
>>     problem: a bad query plan for the query that queues documents for
>>     processing.  That may mean more frequent analysis to prevent. 
>>     But we cannot tell that until we understand what queries are
>>     taking a long time.
>>
>>     Thanks,
>>     Karl
>>
>>
>>
>>     On Fri, Feb 8, 2019 at 8:07 AM LIROT Daniel -
>>     SG/SPSSI/CPII/DOSO/ET <Daniel.Lirot@developpement-durable.gouv.fr
>>     <mailto:Daniel.Lirot@developpement-durable.gouv.fr>> wrote:
>>
>>         Hello,
>>
>>         We use ManifoldCF v2.10, with postgresql (9.6) to crawl our
>>         websites.
>>         this represents approximately 1.2 million documents.
>>         We split the crawl into 4 jobs that distribute their results
>>         on 3 SOLR collections.
>>         The crawl is powerful up to 500000 documents (25000 to 30000
>>         docs / hour) then the performance decreases strongly in
>>         progress, we observe freezes very very long, you might think
>>         that the crawl is stopped.
>>         We suspect a reindexing, noticeably of the intrinsiclink
>>         table which is very important 85 Million lines.
>>         Is it possible to prohibit re-indexing controlled by manifoldCF?
>>         An other idea ?
>>
>>         best Regards
>>         LIROT daniel
>>         -- 
>>
>


Mime
View raw message