manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Date Wed, 08 Mar 2017 11:52:07 GMT
Hi Guy,

The agents thread dump shows that there's a lock stuck from somewhere; I
expect it's from the UI.  Next time this happens, could you get a thread
dump for the UI process as well as from the agents process?  Thanks!!

Karl


On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Guy,
>
> See https://issues.apache.org/jira/browse/CONNECTORS-590.
>
> When you see "unexpected" this is not a good sign:
>
> >>>>>>
>
> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
> Unexpected jobqueue status - record id 1488898668325, expecting active
> status, saw 4
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
> jobqueue status - record id 1488898668325, expecting active status, saw 4
>
>                 at org.apache.manifoldcf.crawler.
> jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
>
>                 at org.apache.manifoldcf.crawler.
> jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
>
>                 at org.apache.manifoldcf.crawler.
> system.WorkerThread.run(WorkerThread.java:710)
>
> <<<<<<
>
>
> I've spent weeks chasing this issue in the past.  I have not seen it at
> all except in certain installations: 2 of them, counting yours.  I
> introduced transaction forensics into the MCF code to figure out what was
> going on and I definitively proved (on Postgresql 9.1) that we were seeing
> a transactional integrity problem with Postgresql itself.  A ticket against
> Postgresql was not logged because they'd need a reproducible test case and
> also the latest version, and it didn't happen on 9.3 at the time.
>
> You also seem to be seeing a deadlock in MCF locks as well.  When the
> Postgresql bug happens it means that the database is in a state that MCF
> really can't figure out and thus doesn't know how to deal with, so this may
> just be a downstream result of that occurrence.  But I can't be sure
> without further analysis.
>
> I'm very curious now about the details of your setup.  (1) How is your
> postgresql set up?  What version did you decide to use?  (2) How many
> agents processes do you have?  Just one?  (3) What OS? (4) What JDK?
>
> Other things you can try:
> (1) Running the postgresql LT tests (ant run-LT-postgresql) against your
> postgresql installation; you will need to change the test code itself to
> allow it to create an instance for testing in that case;
> (2) Turning on database transaction forensics (property name
> "org.apache.manifoldcf.diagnostics" value "DEBUG")
> (3) Turning on lock debugging (property name "org.apache.manifoldcf.lock"
> value "DEBUG")
>
> Transaction forensics uses memory but only generates lots of output when
> something goes wrong, so that may be the best place to start.  Turning on
> lock debugging will generate a large amount of log output but may be
> necessary, especially if we want to figure out if the lock issue is a
> downstream problem caused by the database integrity issue.
>
>
> Thanks,
> Karl
>
>
> On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <Guy.Standen@uk.fujitsu.com>
> wrote:
>
>> Hi Karl,
>>
>>                 I have upgraded to MCF 2.6 on Windows using Zookeeper for
>> synchronisation and PostgreSQL 9.3.16 for the database.
>>
>> I re-ran  the pair of jobs ( 1 web crawl of local intranet site every 15
>> minutes  and  1  crawl of  JDBC connected database every 3 minutes) that
>> would  previously lock up after a couple of iterations.  These seemed to
>> run much better together but after 6 hours running there was a lock up for
>> both these jobs (both indicating they were starting up).
>>
>>
>>
>> Attempting to restart or abort the jobs through the MCF webapp failed.
>>
>>
>>
>> After restarting the MCF agents process all sprang into life and the jobs
>> aborted/restarted as expected.
>>
>>
>>
>> In the  MCF log the following error was found
>>
>> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
>> Unexpected jobqueue status - record id 1488898668325, expecting active
>> status, saw 4
>>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>> jobqueue status - record id 1488898668325, expecting active status, saw 4
>>
>>                 at org.apache.manifoldcf.crawler.
>> jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
>>
>>                 at org.apache.manifoldcf.crawler.
>> jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
>>
>>                 at org.apache.manifoldcf.crawler.
>> system.WorkerThread.run(WorkerThread.java:710)
>>
>> WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption
>> reported for job 1488898090224 connection 'web': Job no longer active
>>
>> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't
>> release lock we don't hold
>>
>> java.lang.IllegalStateException: Can't release lock we don't hold”
>>
>>
>>
>>
>>
>> ManifoldCF does not seem to have recovered from this error and  the jobs
>> can’t be re-started without restarting ManifoldCF agents.
>>
>>
>>
>> The manifoldcf, PostgreSQL and Agents JVM stack trace are attached
>>
>>
>>
>> Best Regards,
>>
>>
>>
>> Guy
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 06 March 2017 09:24
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>>
>>
>>
>> Hi Guy,
>>
>>
>>
>> (1) I have no experience with PostgresSQL versions beyond 9.3, but I
>> doubt you would have problems.
>>
>> (2) If you are using multiple processes, even if there's only one agents
>> process, you must use synchronization.  I would recommend Zookeeper;
>> file-system-based synchronization is deprecated.
>>
>> (3) Windows has many ways of interfering with file-based sync, including
>> path-length issues.  I have seen Windows fail to unlock files and need a
>> reboot to release the lock.  This is one reason why file-system-based
>> locking is deprecated.
>>
>>
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <Guy.Standen@uk.fujitsu.com>
>> wrote:
>>
>> Hi Karl,
>>
>>                 Thanks for that I will try version 2.6.  Whilst moving to
>> MCF 2.6 I would potentially like to upgrade my backend PostgreSQL version
>> from 9.3.5.
>>
>> 1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6
>>   e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2?
>>
>> 2) For a production system on a single server running a single MCF agents
>> process would you recommend the file based synchronisation locking or
>> zookeeper based synchronisation locking.  With the file based
>> synchronisation locking mechanism I have sometimes seen errors of the form :
>>
>>                 'D:\Apps\ManifoldCF\apache-man
>> ifoldcf-2.0.1\multiprocess-file-example\.\.\syncharea\475\
>> 708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr COLL1 osp_unstruct.lock'
>> failed: Access is denied’ ( I have ensured that the SYNCHAREA  is not
>> scanned by AV or Indexed by Windows Search  and all MCF processes run as
>> the same user)
>>
>> What could cause these errors?
>>
>>
>>
>> Many Thanks,
>>
>>
>>
>> Guy
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 03 March 2017 17:35
>>
>>
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>>
>>
>>
>> Hi Guy:
>>
>>
>>
>> It is expected that sometimes database deadlock will develop, and the
>> transaction will need to be retried.  There is code in MCF for doing this:
>>
>>
>>
>> >>>>>>
>>
>>     if (sqlState != null && sqlState.equals("40001"))
>>
>>       return new ManifoldCFException(message,e,
>> ManifoldCFException.DATABASE_TRANSACTION_ABORT);
>>
>> <<<<<<
>>
>>
>>
>> I suspect that your version of MCF is old enough so that this particular
>> error and the associated retry are not taking place.  Upgrading to 2.6 will
>> definitely help there.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <Guy.Standen@uk.fujitsu.com>
>> wrote:
>>
>> Hi Karl,
>>
>>                 Thanks for coming back so quickly. Unfortunately I wasn’t
>> using a JCIFS connection. One of the issues I was seeing was between a
>> crawl of an intranet site (no explicit throttling other than number of
>> connections) and   scheduled crawl (every 5 mins) to a relational DB via
>> JDBC connector again no explicit throttling.  To simplify things both jobs
>> are using a NULL output connection. Sometimes both the Web crawl and the
>> JDBC connection can run together but at other times 1 or both jobs will
>> appear to lock up with just a few active documents showing. When I get a
>> lock up the mcf log contains errors like:
>>
>>
>>
>> “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database
>> exception: SQLException doing query (40001): ERROR: could not serialize
>> access due to read/write dependencies among transactions”
>>
>>
>>
>> See the attached log extract for a little more detail.
>>
>>
>>
>> Any view why this might be happening?
>>
>>
>>
>> Best Regards,
>>
>>
>>
>> Guy
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 03 March 2017 11:27
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>>
>>
>>
>> Hi Guy,
>>
>>
>>
>> A issue with concurrent jobs is known for jobs sharing the same JCIFS
>> connection.  Is that what you are using?  This has nothing to do with the
>> version of Postgresql you are using; it has to do with what "bins"
>> documents are thought to come from.  There has been a recent improvement
>> for this issue, which will be released in April.  See
>> https://issues.apache.org/jira/browse/CONNECTORS-1364.
>>
>>
>>
>> The current version of MCF (2.6) supports Solr 6.x.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <Guy.Standen@uk.fujitsu.com>
>> wrote:
>>
>> Hi Karl,
>>
>> I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have
>> had some issues with multiple jobs running concurrently.
>>
>> I am considering upgrading to MCF 2.6 and to a newer version of
>> PostgreSQL. Would you be able to advise which version of PostgreSQL I
>> should consider using with MCF 2.6 (e.g.  PostgreSQL  9.3.16 or 9.6.2)
>>
>>
>>
>> I am also considering upgrading from SOLR 4.10.3 to a newer version. The
>> MCF compatibility matrix mentions that compatibility has been tested to
>> SOLR version 4.5.1. Do you have any advice about compatibility with the
>> newer versions of SOLR e.g. 6.4.1.
>>
>>
>>
>> Best Regards
>>
>>
>>
>> Guy
>>
>>
>>
>>
>> Unless otherwise stated, this email has been sent from Fujitsu Services
>> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
>> England No 2216100) both with registered offices at: 22 Baker Street,
>> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
>> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
>> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
>> Middlesex, UB4 8FE.
>> This email is only for the use of its intended recipient. Its contents
>> are subject to a duty of confidence and may be privileged. Fujitsu does not
>> guarantee that this email has not been intercepted and amended or that it
>> is virus-free.
>>
>>
>>
>>
>> Unless otherwise stated, this email has been sent from Fujitsu Services
>> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
>> England No 2216100) both with registered offices at: 22 Baker Street,
>> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
>> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
>> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
>> Middlesex, UB4 8FE.
>> This email is only for the use of its intended recipient. Its contents
>> are subject to a duty of confidence and may be privileged. Fujitsu does not
>> guarantee that this email has not been intercepted and amended or that it
>> is virus-free.
>>
>>
>>
>>
>> Unless otherwise stated, this email has been sent from Fujitsu Services
>> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
>> England No 2216100) both with registered offices at: 22 Baker Street,
>> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
>> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
>> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
>> Middlesex, UB4 8FE.
>> This email is only for the use of its intended recipient. Its contents
>> are subject to a duty of confidence and may be privileged. Fujitsu does not
>> guarantee that this email has not been intercepted and amended or that it
>> is virus-free.
>>
>>
>>
>> Unless otherwise stated, this email has been sent from Fujitsu Services
>> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
>> England No 2216100) both with registered offices at: 22 Baker Street,
>> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
>> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
>> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
>> Middlesex, UB4 8FE.
>> This email is only for the use of its intended recipient. Its contents
>> are subject to a duty of confidence and may be privileged. Fujitsu does not
>> guarantee that this email has not been intercepted and amended or that it
>> is virus-free.
>>
>
>

Mime
View raw message