manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Error: Unexpected jobqueue status - record id X, expecting active status, saw 4 (MySQL compatible Database)
Date Sat, 08 Jun 2019 05:54:51 GMT
And yes, we'd also need to hand the mysql folks a similar test case.
Karl


On Sat, Jun 8, 2019 at 1:53 AM Karl Wright <daddywri@gmail.com> wrote:

> Here's an explanation for Postgresql about what is supposed to happen in
> this case.  See slide 7.
>
> https://www.postgresql.org/files/developer/concurrency.pdf
>
> The explanation should apply to all MVCC databases as well.  The sequence
> we're seeing is:
>
> Thread 1:
> <start transaction>
> <do_expensive_stuffer_query>
> <update_jobqueue_status>
> <end transaction>
> <hand records to worker threads>
>
>
> Worker thread:
> <process document>
> <begin transaction>
> <change jobqueue status from active to completed>
> <end transaction
>
> The worker thread, however, is finding that the jobqueue status is not
> what was expected.  It's as if transactions didn't actually serialize, once
> in a great while, and either the original stuffer update never became
> visible by the time the worker thread completed its work, or some other
> thread modified it while the worker thread was processing the document.
> But I've got every jobqueue job status write instrumented so if that were
> what was happening we'd see it in the diagnostic dump, and we do not.
>
> I wish it were reasonably possible to hand the postgresql folks a test
> case that demonstrated this behavior.  But, as you know, the problem is
> quite rare, and you have to pound on the database for an extended period of
> time before you get this to happen. :-(
>
> Karl
>
> On Fri, Jun 7, 2019 at 3:28 AM Karl Wright <daddywri@gmail.com> wrote:
>
>> I just reviewed the places where JobQueue status is updated to be sure
>> that all places this happens are included in the diagnostics trackiing.
>> They were, save for when a cluster gets restarted, and I added that to the
>> code in trunk as well.
>>
>> So the analysis, above, stands.
>>
>> :-(
>>
>> Karl
>>
>>
>> On Thu, Jun 6, 2019 at 4:28 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> I just looked at this briefly.
>>>
>>> For
>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204,
>>> look for the following record ID:
>>>
>>> 1522147023170
>>>
>>> This is the one when we look at it after the document is processed that it has
status 4:
>>>
>>> >>>>>>
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.380401275Z 2019-06-06T04:01:05,380 DEBUG
[Worker thread '37'] org.apache.manifoldcf.diagnostics: ==== Forensics for record 1522147023170,
current status: 4 ====
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.380512360Z 2019-06-06T04:01:05,380 DEBUG
[Worker thread '37'] org.apache.manifoldcf.diagnostics: === Current stack trace ===
>>> manifoldcf_1     | 2019-06-06T02:01:05.380544192Z java.lang.Exception: Forensics
stack trace
>>> manifoldcf_1     | 2019-06-06T02:01:05.380548494Z 	at org.apache.manifoldcf.crawler.jobs.TrackerClass.printForensics(TrackerClass.java:216)
[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.380552564Z 	at org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1028)
[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.380568210Z 	at org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3414)
[mcf-pull-agent.jar:?]
>>>
>>> <<<<<<
>>>
>>> It expects it to have status 5, though, because the only way the document could
be being worked on at all is if the document had been put into status 5 by the stuffer thread:
>>>
>>> >>>>>>
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.387210733Z 2019-06-06T04:01:05,381 DEBUG
[Worker thread '37'] org.apache.manifoldcf.diagnostics: == About to commit transaction by
'Stuffer thread' at 1559786464972 ==
>>> manifoldcf_1     | 2019-06-06T02:01:05.387250827Z java.lang.Exception: Precommit
stack trace
>>> manifoldcf_1     | 2019-06-06T02:01:05.387256464Z 	at org.apache.manifoldcf.crawler.jobs.TrackerClass.notePrecommit(TrackerClass.java:108)
~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387260987Z 	at org.apache.manifoldcf.crawler.jobs.JobManager.fetchAndProcessDocuments(JobManager.java:3295)
~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387265034Z 	at org.apache.manifoldcf.crawler.jobs.JobManager.getNextDocuments(JobManager.java:3014)
~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387269073Z 	at org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:186)
~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387285663Z 2019-06-06T04:01:05,387 DEBUG
[Worker thread '37'] org.apache.manifoldcf.diagnostics:     Transaction includes:
>>> manifoldcf_1     | 2019-06-06T02:01:05.387290117Z 2019-06-06T04:01:05,387 DEBUG
[Worker thread '37'] org.apache.manifoldcf.diagnostics:       Record 1522147023170 status
modified to 5: Make active
>>>
>>> ...
>>>
>>> <<<<<<
>>>
>>> The actual commit follows a short while later to actually set the status, and
only AFTER that are the documents queued to be handed to the worker threads.
>>>
>>> So what this is saying is that the stuffer thread sets the document status to
5 and then a worker thread gets it and when the worker thread is done it tries to set the
status back to 4 -- but it's ALREADY at 4 and that is not possible.
>>>
>>> Now, it may be worth examining the possibility that some other job gets to the
document while it is waiting to be queued, and does stuff to the document's jobqueue record.
 There are a couple of background threads which (for example) reprioritize documents that
might conceivably change the jobqueue record's status.  They generally work by changing the
record status in a transaction to some other value that would not make the record eligible
for queuing, and then changing it back when done.  But unless completely overlooked, these
would, however, be visible in the diagnostics, and furthermore it should not be possible for
one thread in a transaction to change the status at the same time as another thread in a transaction
changes the status.
>>>
>>> So either way I'm forced to conclude that a record gets changed and committed,
but the change is not visible to another thread later trying to reset the status.
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Jun 6, 2019 at 5:58 AM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I can look at the log output but not until this weekend.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Jun 6, 2019 at 3:59 AM Markus Schuch <
>>>> markus.schuch@deutschebahn.com> wrote:
>>>>
>>>>> Hi Olivier,
>>>>>
>>>>>
>>>>>
>>>>> we were not able to fix this yet.
>>>>>
>>>>>
>>>>>
>>>>> But now we have new diagnostics log data, after the error occurred
>>>>> again yesterday:
>>>>>
>>>>>
>>>>>
>>>>> Unexpected jobqueue status - record id 1522147023170, expecting active
>>>>> status, saw 4
>>>>>
>>>>>
>>>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204
>>>>>
>>>>>
>>>>>
>>>>> Unexpected jobqueue status - record id 1541529864711, expecting active
>>>>> status, saw 4
>>>>>
>>>>>
>>>>> https://gist.githubusercontent.com/schuch/d73672e44b0edd5a9184170c9ee417a5/raw/f55b1c023056ef8e310d99489a0b6f9dfcaaa585/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201541529864711,%2520expecting%2520active%2520status,%2520saw%25204
>>>>>
>>>>>
>>>>>
>>>>> @Karl, maybe you can get some ideas from that log data?
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>> *Von:* Olivier Tavard <olivier.tavard@francelabs.com>
>>>>> *Gesendet:* Dienstag, 21. Mai 2019 11:00
>>>>> *An:* user@manifoldcf.apache.org
>>>>> *Betreff:* Re: Error: Unexpected jobqueue status - record id X,
>>>>> expecting active status, saw 4 (MySQL compatible Database)
>>>>>
>>>>>
>>>>>
>>>>> Hi Markus,
>>>>>
>>>>>
>>>>>
>>>>> We have the same error (with postgresql database). Did the error occur
>>>>> again since your last mail ?
>>>>>
>>>>> Did you change something on your MCF configuration to fix this ?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>>
>>>>> Olivier
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Le 13 févr. 2019 à 13:58, Markus Schuch <
>>>>> markus.schuch@deutschebahn.com> a écrit :
>>>>>
>>>>>
>>>>>
>>>>> Hi Karl,
>>>>>
>>>>>
>>>>>
>>>>> we set the diagnostigs logger to level debug.
>>>>>
>>>>>
>>>>>
>>>>> I will get back when the error occurs again.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190211>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>>
>>>>>
>>>>> Le 12 févr. 2019 à 17:41, Karl Wright <daddywri@gmail.com> a
écrit :
>>>>>
>>>>>
>>>>>
>>>>> Hi Marcus,
>>>>>
>>>>>
>>>>>
>>>>> There's a properties.xml debugging logger you can enable that will
>>>>> keep track of what's happening with transactions, so that when an error
of
>>>>> this kinds is reported, information about why the situation is unexpected
>>>>> is dumped to the log.  The logger is called "diagnostics" e.g.
>>>>> "org.apache.manifoldcf.diagnostics".
>>>>>
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 12, 2019 at 10:53 AM Markus Schuch <
>>>>> markus.schuch@deutschebahn.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> we are seeing "Error: Unexpected jobqueue status - record id
>>>>> 1484612513829, expecting active status, saw 4" from time to time.
>>>>> I didn’t report it, because we were running on an old MCF version,
and
>>>>> there were some bugreports relating to this error that are resolved in
>>>>> newer versions.
>>>>>
>>>>> Now we have upgraded to latest and greatest and the error still
>>>>> occurred. So want to start to track this down.
>>>>>
>>>>> Out setup is:
>>>>>
>>>>>    - ManifoldCF 2.12, running in a Docker Container based on Redhat
>>>>>    Linux, OpenJDK 8
>>>>>    - AWS RDS Database (Aurora MySQL -> 5.6 compatible)
>>>>>    - Single Process Setup
>>>>>
>>>>>
>>>>> We run several Jobs (run once mode) over night with schedule windows
>>>>> and max runtime settings. Some of the time windows are overlapping.
>>>>> At normal days some Jobs finish during their time window, some go to
>>>>> WAITING and will go on in the next night.
>>>>>
>>>>> The error mostly hits at a Sharepoint Repository Job.
>>>>> We have the impression, that the error is somehow related to
>>>>> situations, when service interruptions (e.g. connection issues) occur
in
>>>>> other jobs.
>>>>>
>>>>>
>>>>>
>>>>> Maybe all this is related to fundamental problems with the database or
>>>>> the JDBC driver
>>>>>
>>>>> (Karl expressed his concerns about that in the maybe-related
>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1581 ticket)
>>>>>
>>>>> On the other hand, there were similar bugreports in the past
>>>>> concerning unexpected jobqueue status and there were resolved.
>>>>>
>>>>> Mayby there is a chance this can also be analyzed and fixed, but I do
>>>>> not really know where to start.
>>>>>
>>>>>
>>>>> Has anybody ideas how we can track this down / debug this to get to
>>>>> the bottom of the problem?
>>>>> What could be the first step in the analysis?
>>>>>
>>>>> Many thanks in advance
>>>>> Markus
>>>>>
>>>>> May be related bugreports:
>>>>>
>>>>>
>>>>>
>>>>> [0] https://issues.apache.org/jira/browse/CONNECTORS-1395
>>>>> [1] https://issues.apache.org/jira/browse/CONNECTORS-1180
>>>>> [2] https://issues.apache.org/jira/browse/CONNECTORS-590
>>>>> [3] https://issues.apache.org/jira/browse/CONNECTORS-246
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Markus Schuch
>>>>>
>>>>> Web Business (T.IPB 26)
>>>>>
>>>>>
>>>>>
>>>>> DB Systel GmbH
>>>>>
>>>>> Jürgen-Ponto-Platz 1, 60329 Frankfurt a. Main
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190211>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190529>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>

Mime
View raw message