"Another observation from the document status - when they get status "waiting for processing" their "Scheduled time" are in the future - "job restart time + 1 day". That could be the reason, why the job does not exit - there are still active document, but not yet ready for processing."

This happens ONLY because the document is getting requeued.  A document is requeued ONLY because of (a) errors, (b) the job is running in "continuous" mode.

The 24 hour period is suspicious because that's the default recrawl time in continuous mode.  Are you running the job in continuous mode?

Karl


On Wed, Jul 20, 2016 at 4:56 PM, jetnet <jetnet@gmail.com> wrote:
We use a Postgres instance, the crawler job is a windows share job. There are 3 windows file share job using the same repository. And all of them have the same issue.
So, I've stopped 2 jobs and currently debugging the last one.
Another observation from the document status - when they get status "waiting for processing" their "Scheduled time" are in the future - "job restart time + 1 day". That could be the reason, why the job does not exit - there are still active document, but not yet ready for processing.

Thanks!


2016-07-20 22:08 GMT+02:00 Karl Wright <daddywri@gmail.com>:
Hi,

Can you provide more details about your setup, etc.?  Specifically:

- what database is this?
- how many jobs are there?
- are there any jobs that have lots of documents that are in an unusual state?

The query that the stuffer thread runs is not magic, and if it returns only a couple of documents every now and then there must be a reason.  I do not see the kind of behavior you are describing here.

Karl


On Wed, Jul 20, 2016 at 3:35 PM, jetnet <jetnet@gmail.com> wrote:
hi Karl,

actually, there are no failed documents at all. All documents (without exception) have status "Waiting for Processing" and the job is hanging around with status "Running", but nothing happens.
When the job gets restarted, then all docs become status "About to Process", and then slowly (2 docs / 2 seconds) moving to "Waiting for Processing" again.
I could delete the job and re-create it again, but I'd like to understand the root cause of the issue.

Any hints where I could look else very much appreciated!
Thanks!

2016-07-19 15:31 GMT+02:00 Karl Wright <daddywri@gmail.com>:
Hi Konstantin,

It's not possible to figure out from the log why the stuffer thread queued only two documents at that time.  Probably it's because there were no other eligible documents.  Documents in the queue have a priority and they also have a minimum time they can be acted upon.  If you have documents that failed earlier, they get requeued depending on the particular kind of failure (up to the connector) with specific times.  There's no way to make the queue resolve "faster" if it's full of documents that were requeued in this way due to errors.

You can figure all this out by using the Document Status and Queue Status reports.

Karl


On Tue, Jul 19, 2016 at 7:29 AM, jetnet <jetnet@gmail.com> wrote:
Hi Karl,

the documentation states: "it [Document Stuffer Thread] stuffs some number that is a
multiple of the number of Worker Threads at one time, up to some limit (which is also
related to the number of Worker Threads)."

here is the worker threads config:

  <property name="org.apache.manifoldcf.database.maxhandles" value="200"/>
  <property name="org.apache.manifoldcf.crawler.threads" value="100"/>



So, I'd expect, that the stuffer thread would put about 200 docs in the queue, but it puts only 2...

DEBUG 2016-07-19 13:24:02,916 (Stuffer thread) - Document stuffer thread woke up
DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document 'smb://...' that has priority 10.284489143231797 onto active list
DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document 'smb://...' that has priority 10.28520622168017 onto active list
DEBUG 2016-07-19 13:24:02,963 (Set priority thread) - Reprioritizing 1 documents
DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing 1 documents
DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Setting new minimum depth value to 29273.99867577113
DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Stuffer thread: Found 2 documents to queue
DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing 1 documents
DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document 'smb://...' with bins [SERVER ] onto active queue
DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document 'smb://...' with bins [SERVER ] onto active queue
DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Document stuffer thread woke up


Or did I miss something?

Thanks!


2016-07-19 12:51 GMT+02:00 Karl Wright <daddywri@gmail.com>:
Hi Konstantin,

The stuffer thread operates independently of the worker threads.  It wakes up, and if anything is available, stuffs as much as it can, and when done sleeps for a short time.  The queue is maintained at at least twice the number of active worker threads.  See ManifoldCF in Action.

Karl


On Tue, Jul 19, 2016 at 6:33 AM, jetnet <jetnet@gmail.com> wrote:
hi All,

I've encountered recently an issue with the crawler (JCIFS connector): when a jobs gets started, all it's documents are being checked, and this process is taking too long. After tuning DEBUG on, I found, that there are ~2 seconds delay when processing the document queue:

e.g.:

DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - JCIFS: Leaving wouldFileBeIncluded for 'smb://...
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Worker thread done processing 1 documents
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding 1453999232278 to finishList
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding 1453999232278 to ingesterCheckList
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Finishing documents {1453999232278 }
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeueing documents due to carrydown {}
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeuing {1453999232278 }
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Deleting {}
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Hopcount removal {}
DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Rescanning documents {}
DEBUG 2016-07-19 10:31:06,500 (Stuffer thread) - Stuffer thread: Found 0 documents to queue
DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread) - Document cleanup stuffer thread woke up
DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread) - Document delete stuffer thread woke up
DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread) - Document cleanup stuffer thread found nothing to do
DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread) - Document delete stuffer thread found nothing to do
DEBUG 2016-07-19 10:31:07,375 (Set priority thread) - Done reprioritizing because no more documents to reprioritize
DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread) - Document cleanup stuffer thread woke up
DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread) - Document delete stuffer thread woke up
DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread) - Document delete stuffer thread found nothing to do
DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread) - Document cleanup stuffer thread found nothing to do
DEBUG 2016-07-19 10:31:08,500 (Stuffer thread) - Document stuffer thread woke up
DEBUG 2016-07-19 10:31:08,516 (Stuffer thread) - Stuffer thread: Found 2 documents to queue
DEBUG 2016-07-19 10:31:08,531 (Stuffer thread) - Document stuffer thread woke up
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread processing documents: 1453999191642
DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread processing documents: 1453999188326
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread starting document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread starting document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Post-relationship document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Post-relationship document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') -  Post-hopcount pruned document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') -  Post-hopcount pruned document count is 1
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread about to process {1453999191642 }
DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread about to process {1453999188326 }
DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - JCIFS: Processing 'smb://...


As one can see, there is a delay between the thread-61 has left the checks and next 2 threads have started.
Any idea why that happens? Why the document stuffer thread does not wake up immediately? Or - why the "document queue batch size" is only 2?

P.S. MCF version 2.3

Thanks!

--
Konstantin