manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SAUNIER Maxence <>
Subject Problems and evolutions (CITYA)
Date Fri, 22 Dec 2017 15:07:51 GMT

I am Maxence SAUNIER and I work at CITYA Immobilier in France. We use ManifoldCF to crawl
several tens of millions of documents at the moment. And, we have/encounter various problems
related with ManifoldCF.

ManifoldCF API for scripts:

  *   In the webservice /json/jobstatuses
     *   the job ’name’ field is named 'description'. It is a mistake?
     *   Is there a possibility to have a 'job_name' field in addition to the 'machine' field?
        *   Today, I am forced to request the url /json/jobs and save the result in local
files for each of my servers in order to link the id of the jobs with their names to display
for the user of the script. This request takes a lot of time that could be avoided.


  *   After a certain amount of time, we are constantly having problems with IOWait on the
virtual machine. Here are the features and details.
  *   Features of the virtual machine:
     *   15K Disk 140Go
     *   12 Go RAM
     *   4 vCPU
     *   Allocation RAM postgres : 7Go
     *   Allocation RAM ManifoldCF : 4Go
     *   System Debian, used 130Mo RAM
  *   I investigated and the reason for the IO would be the postgresql and its queries ANALIZED.
Screens joins at this email.
  *   Why are there EXPLAIN queries?

"postgres";"postgres";"";"2017-11-15 11:05:42.855741+01";"active";"SELECT datname,
usename, client_addr, query_start, state, REGEXP_REPLACE(query, E' *[\\n\\r]+ *', ' ', 'g')
AS query FROM pg_stat_activity ORDER BY query_start DESC;"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:42.036039+01";"idle";"SELECT * FROM
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.442853+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.41319+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.410481+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.308415+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.308415+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.301668+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.301102+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.300208+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.288904+01";"idle";"COMMIT"
"manifoldbdd";"manifoldcf";"";"2017-11-15 11:05:36.24823+01";"active";"ANALYZE jobqueue"
"postgres";"postgres";"";"2017-11-15 11:00:42.07803+01";"idle";"SELECT 1 FROM
pg_available_extensions WHERE name='adminpack'"
"postgres";"postgres";"";"2017-11-15 10:59:24.869264+01";"idle";"SELECT version();"
"manifoldbdd";"postgres";"";"2017-11-15 10:59:19.29933+01";"idle";"SELECT rolname
FROM pg_roles WHERE rolcanlogin ORDER BY 1"


Local Tika content text:

  *   We need to register in the Solr the 'content_text' of the indexed files. Despite the
creation of fields 'content_fr', 'content_en', 'content', ’text' or 'content_text' and the
addition of these in Solr, the content is not sent by ManifoldCF or not register by Solr.
In ManifoldCF, local Tika has been set to send all metadata and I don’t know if the problem
comes from Tika, Manifold CF or Solr. A missing configuration? Do you have a process for adding
this content_text field without taking into account the language? (All our documents are in

Thanks for your help.



View raw message