manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Standen Guy <Guy.Stan...@uk.fujitsu.com>
Subject RE: Advice on which PostgreSQL to use with ManifoldCF 2.6
Date Wed, 08 Mar 2017 16:12:51 GMT
Hi Karl,
                I have attached a file to CONNECTORS-1395  which shows an excerpt from the
Zookeeper  console log.  Unfortunately I don’t have enough history in the console to see
much before the event and may have lost some important information.

This  Zookeeper issue happened in the middle of the night and no one would have manually instigated
it.

Best Regards,

Guy

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: 08 March 2017 13:45
To: user@manifoldcf.apache.org
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

If nobody recycled Zookeeper intentionally at 00:26:00, is it possible that some automated
process "cleaned up" zookeeper temporary files out from under it at around this time?  Really,
we're seeing a catastrophic failure of Zookeeper to retain anything beyond that point, and
I have no explanation for that behavior at all, and have never seen it before.

This could also potentially explain the apparent postgresql transaction integrity issues,
because in some cases we rely on local Zookeeper locks to prevent threads from interfering
with one another.

Karl


On Wed, Mar 8, 2017 at 8:37 AM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
Right, sorry, I overlooked this attachment in your original mail.  Have a look at the ticket
for updated status of the research, or later posts in this thread.

Karl


On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <Guy.Standen@uk.fujitsu.com<mailto:Guy.Standen@uk.fujitsu.com>>
wrote:
Hi Karl,
Attached is the MCF trace that includes all the logging pertaining to the FATAL error  ( the
last entry is at 00:36:56,798  there was no further logging until  08:00 when I looked at
the jobs in the morning). This is the same trace I sent on the original mail, I’m afraid
I have no more trace than this.

I’ll try and reproduce the problem with forensic logging on and append the traces to  connectors-1395.

Best Regards,

Guy

From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: 08 March 2017 12:32

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

I've now had a look at everything available to me on your issue.  I've created CONNECTORS-1395
to track this issue; please attach any new materials to that ticket.

There's an exception trace you didn't include, specifically for this FATAL exception:

FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't release lock we
don't hold

java.lang.IllegalStateException: Can't release lock we don't hold”

This is very probably the smoking gun for the stuck lock.  I'd really like that exception
trace from the log if you wouldn't mind.

What has happened is this:

(1) The document fetch cycle runs like this: (a) "stuff" the document (by changing its state
to "active", (b) fetch the document, (c) mark the document completed.  For a specific document,
after it's been fetched, when we're trying to mark the document as being "completed" we do
not see the expected "active" status.  Instead we see a status of "pending purgatory", which
means that either some other thread changed the document's status out from under us, or that
document's status was never in fact set to "active" during the stuffing phase.  Neither of
these is possible given the code paths available, but we can prove it one way or another by
turning on "forensic" debugging, as I described above.

(2) Once the failure happens, the job in question should abort.  But it seems like that abort
does not complete because there's a second problem with lock management (which generates the
FATAL message above).  This should be readily fixed if I can get that trace.

Thanks,
Karl

On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
Hi Guy,

The agents thread dump shows that there's a lock stuck from somewhere; I expect it's from
the UI.  Next time this happens, could you get a thread dump for the UI process as well as
from the agents process?  Thanks!!

Karl


On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
Hi Guy,

See https://issues.apache.org/jira/browse/CONNECTORS-590.

When you see "unexpected" this is not a good sign:

>>>>>>
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: Unexpected jobqueue
status - record id 1488898668325, expecting active status, saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
Unexpected jobqueue status - record id 1488898668325, expecting active status, saw 4
                at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
                at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
                at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:710)
<<<<<<

I've spent weeks chasing this issue in the past.  I have not seen it at all except in certain
installations: 2 of them, counting yours.  I introduced transaction forensics into the MCF
code to figure out what was going on and I definitively proved (on Postgresql 9.1) that we
were seeing a transactional integrity problem with Postgresql itself.  A ticket against Postgresql
was not logged because they'd need a reproducible test case and also the latest version, and
it didn't happen on 9.3 at the time.

You also seem to be seeing a deadlock in MCF locks as well.  When the Postgresql bug happens
it means that the database is in a state that MCF really can't figure out and thus doesn't
know how to deal with, so this may just be a downstream result of that occurrence.  But I
can't be sure without further analysis.

I'm very curious now about the details of your setup.  (1) How is your postgresql set up?
 What version did you decide to use?  (2) How many agents processes do you have?  Just one?
 (3) What OS? (4) What JDK?

Other things you can try:
(1) Running the postgresql LT tests (ant run-LT-postgresql) against your postgresql installation;
you will need to change the test code itself to allow it to create an instance for testing
in that case;
(2) Turning on database transaction forensics (property name "org.apache.manifoldcf.diagnostics"
value "DEBUG")
(3) Turning on lock debugging (property name "org.apache.manifoldcf.lock" value "DEBUG")

Transaction forensics uses memory but only generates lots of output when something goes wrong,
so that may be the best place to start.  Turning on lock debugging will generate a large amount
of log output but may be necessary, especially if we want to figure out if the lock issue
is a downstream problem caused by the database integrity issue.


Thanks,
Karl


On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <Guy.Standen@uk.fujitsu.com<mailto:Guy.Standen@uk.fujitsu.com>>
wrote:
Hi Karl,
                I have upgraded to MCF 2.6 on Windows using Zookeeper for synchronisation
and PostgreSQL 9.3.16 for the database.
I re-ran  the pair of jobs ( 1 web crawl of local intranet site every 15 minutes  and  1 
crawl of  JDBC connected database every 3 minutes) that would  previously lock up after a
couple of iterations.  These seemed to run much better together but after 6 hours running
there was a lock up for both these jobs (both indicating they were starting up).

Attempting to restart or abort the jobs through the MCF webapp failed.

After restarting the MCF agents process all sprang into life and the jobs aborted/restarted
as expected.

In the  MCF log the following error was found
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: Unexpected jobqueue
status - record id 1488898668325, expecting active status, saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
Unexpected jobqueue status - record id 1488898668325, expecting active status, saw 4
                at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
                at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
                at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:710)
WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption reported for job 1488898090224
connection 'web': Job no longer active
WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption reported for job
1488898090224 connection 'web': Job no longer active
FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't release lock we
don't hold
java.lang.IllegalStateException: Can't release lock we don't hold”


ManifoldCF does not seem to have recovered from this error and  the jobs can’t be re-started
without restarting ManifoldCF agents.

The manifoldcf, PostgreSQL and Agents JVM stack trace are attached

Best Regards,

Guy

From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: 06 March 2017 09:24
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

(1) I have no experience with PostgresSQL versions beyond 9.3, but I doubt you would have
problems.
(2) If you are using multiple processes, even if there's only one agents process, you must
use synchronization.  I would recommend Zookeeper; file-system-based synchronization is deprecated.
(3) Windows has many ways of interfering with file-based sync, including path-length issues.
 I have seen Windows fail to unlock files and need a reboot to release the lock.  This is
one reason why file-system-based locking is deprecated.

Thanks,
Karl


On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <Guy.Standen@uk.fujitsu.com<mailto:Guy.Standen@uk.fujitsu.com>>
wrote:
Hi Karl,
                Thanks for that I will try version 2.6.  Whilst moving to MCF 2.6 I would
potentially like to upgrade my backend PostgreSQL version from 9.3.5.
1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6   e.g. PostgreSQL
9.3.16 or PostgreSQL 9.6.2?
2) For a production system on a single server running a single MCF agents process would you
recommend the file based synchronisation locking or zookeeper based synchronisation locking.
 With the file based synchronisation locking mechanism I have sometimes seen errors of the
form :
                'D:\Apps\ManifoldCF\apache-manifoldcf-2.0.1\multiprocess-file-example\.\.\syncharea\475\708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr
COLL1 osp_unstruct.lock' failed: Access is denied’ ( I have ensured that the SYNCHAREA 
is not  scanned by AV or Indexed by Windows Search  and all MCF processes run as the same
user)
What could cause these errors?

Many Thanks,

Guy

From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: 03 March 2017 17:35

To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy:

It is expected that sometimes database deadlock will develop, and the transaction will need
to be retried.  There is code in MCF for doing this:

>>>>>>
    if (sqlState != null && sqlState.equals("40001"))
      return new ManifoldCFException(message,e,ManifoldCFException.DATABASE_TRANSACTION_ABORT);
<<<<<<

I suspect that your version of MCF is old enough so that this particular error and the associated
retry are not taking place.  Upgrading to 2.6 will definitely help there.

Karl


On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <Guy.Standen@uk.fujitsu.com<mailto:Guy.Standen@uk.fujitsu.com>>
wrote:
Hi Karl,
                Thanks for coming back so quickly. Unfortunately I wasn’t using a JCIFS
connection. One of the issues I was seeing was between a crawl of an intranet site (no explicit
throttling other than number of connections) and   scheduled crawl (every 5 mins) to a relational
DB via JDBC connector again no explicit throttling.  To simplify things both jobs  are using
a NULL output connection. Sometimes both the Web crawl and the JDBC connection can run together
but at other times 1 or both jobs will appear to lock up with just a few active documents
showing. When I get a lock up the mcf log contains errors like:

“DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database exception: SQLException
doing query (40001): ERROR: could not serialize access due to read/write dependencies among
transactions”

See the attached log extract for a little more detail.

Any view why this might be happening?

Best Regards,

Guy






From: Karl Wright [mailto:daddywri@gmail.com<mailto:daddywri@gmail.com>]
Sent: 03 March 2017 11:27
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Hi Guy,

A issue with concurrent jobs is known for jobs sharing the same JCIFS connection.  Is that
what you are using?  This has nothing to do with the version of Postgresql you are using;
it has to do with what "bins" documents are thought to come from.  There has been a recent
improvement for this issue, which will be released in April.  See https://issues.apache.org/jira/browse/CONNECTORS-1364.

The current version of MCF (2.6) supports Solr 6.x.

Thanks,
Karl


On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <Guy.Standen@uk.fujitsu.com<mailto:Guy.Standen@uk.fujitsu.com>>
wrote:
Hi Karl,
I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have had some issues with
multiple jobs running concurrently.
I am considering upgrading to MCF 2.6 and to a newer version of PostgreSQL. Would you be able
to advise which version of PostgreSQL I should consider using with MCF 2.6 (e.g.  PostgreSQL
 9.3.16 or 9.6.2)

I am also considering upgrading from SOLR 4.10.3 to a newer version. The MCF compatibility
matrix mentions that compatibility has been tested to SOLR version 4.5.1. Do you have any
advice about compatibility with the newer versions of SOLR e.g. 6.4.1.

Best Regards

Guy


Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.


Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.


Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.


Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.




Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.



Unless otherwise stated, this email has been sent from Fujitsu Services Limited (registered
in England No 96056); Fujitsu EMEA PLC (registered in England No 2216100) both with registered
offices at: 22 Baker Street, London W1U 3BW;  PFU (EMEA) Limited, (registered in England No
1578652) and Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) both
with registered offices at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.

This email is only for the use of its intended recipient. Its contents are subject to a duty
of confidence and may be privileged. Fujitsu does not guarantee that this email has not been
intercepted and amended or that it is virus-free.
Mime
View raw message