manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jetnet <jet...@gmail.com>
Subject Re: 2 seconds delay when checking documents
Date Thu, 21 Jul 2016 07:53:46 GMT
hi Karl,

it's a shame, but I must say - the problem was sitting in front of the
monitor :)
I was sure, the jobs were in non-continuous mode, and didn't look at this,
but the scheduler's setting was: "rescan dynamically".
Since the bin's throttling was set to 60 docs/min, the stuffer thread was
doing everything right - it woke up every 2 sec and put 2 docs into the
memory queue. Please correct me if I'm wrong.
The world is OK again :)
Thank you very much for helping with this question.
--
Konstantin


2016-07-21 0:18 GMT+02:00 Karl Wright <daddywri@gmail.com>:

> "Another observation from the document status - when they get status
> "waiting for processing" their "Scheduled time" are in the future - "job
> restart time + 1 day". That could be the reason, why the job does not exit
> - there are still active document, but not yet ready for processing."
>
> This happens ONLY because the document is getting requeued.  A document is
> requeued ONLY because of (a) errors, (b) the job is running in "continuous"
> mode.
>
> The 24 hour period is suspicious because that's the default recrawl time
> in continuous mode.  Are you running the job in continuous mode?
>
> Karl
>
>
> On Wed, Jul 20, 2016 at 4:56 PM, jetnet <jetnet@gmail.com> wrote:
>
>> We use a Postgres instance, the crawler job is a windows share job. There
>> are 3 windows file share job using the same repository. And all of them
>> have the same issue.
>> So, I've stopped 2 jobs and currently debugging the last one.
>> Another observation from the document status - when they get status
>> "waiting for processing" their "Scheduled time" are in the future - "job
>> restart time + 1 day". That could be the reason, why the job does not exit
>> - there are still active document, but not yet ready for processing.
>>
>> Thanks!
>>
>>
>> 2016-07-20 22:08 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>
>>> Hi,
>>>
>>> Can you provide more details about your setup, etc.?  Specifically:
>>>
>>> - what database is this?
>>> - how many jobs are there?
>>> - are there any jobs that have lots of documents that are in an unusual
>>> state?
>>>
>>> The query that the stuffer thread runs is not magic, and if it returns
>>> only a couple of documents every now and then there must be a reason.  I do
>>> not see the kind of behavior you are describing here.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Jul 20, 2016 at 3:35 PM, jetnet <jetnet@gmail.com> wrote:
>>>
>>>> hi Karl,
>>>>
>>>> actually, there are no failed documents at all. All documents (without
>>>> exception) have status "Waiting for Processing" and the job is hanging
>>>> around with status "Running", but nothing happens.
>>>> When the job gets restarted, then all docs become status "About to
>>>> Process", and then slowly (2 docs / 2 seconds) moving to "Waiting for
>>>> Processing" again.
>>>> I could delete the job and re-create it again, but I'd like to
>>>> understand the root cause of the issue.
>>>>
>>>> Any hints where I could look else very much appreciated!
>>>> Thanks!
>>>>
>>>> 2016-07-19 15:31 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>
>>>>> Hi Konstantin,
>>>>>
>>>>> It's not possible to figure out from the log why the stuffer thread
>>>>> queued only two documents at that time.  Probably it's because there
were
>>>>> no other eligible documents.  Documents in the queue have a priority
and
>>>>> they also have a minimum time they can be acted upon.  If you have
>>>>> documents that failed earlier, they get requeued depending on the
>>>>> particular kind of failure (up to the connector) with specific times.
>>>>> There's no way to make the queue resolve "faster" if it's full of documents
>>>>> that were requeued in this way due to errors.
>>>>>
>>>>> You can figure all this out by using the Document Status and Queue
>>>>> Status reports.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Jul 19, 2016 at 7:29 AM, jetnet <jetnet@gmail.com> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> the documentation states: "it [Document Stuffer Thread] stuffs some
>>>>>> number that is a
>>>>>> multiple of the number of Worker Threads at one time, up to some
>>>>>> limit (which is also
>>>>>> related to the number of Worker Threads)."
>>>>>>
>>>>>> here is the worker threads config:
>>>>>>
>>>>>>   <property name="org.apache.manifoldcf.database.maxhandles"
>>>>>> value="200"/>
>>>>>>   <property name="org.apache.manifoldcf.crawler.threads" value="100"/>
>>>>>>
>>>>>>
>>>>>> So, I'd expect, that the stuffer thread would put about 200 docs
in
>>>>>> the queue, but it puts only 2...
>>>>>>
>>>>>> DEBUG 2016-07-19 13:24:02,916 (Stuffer thread) - Document stuffer
>>>>>> thread woke up
>>>>>> DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document
>>>>>> 'smb://...' that has priority 10.284489143231797 onto active list
>>>>>> DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document
>>>>>> 'smb://...' that has priority 10.28520622168017 onto active list
>>>>>> DEBUG 2016-07-19 13:24:02,963 (Set priority thread) - Reprioritizing
>>>>>> 1 documents
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing
>>>>>> 1 documents
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Setting new minimum
>>>>>> depth value to 29273.99867577113
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Stuffer thread:
>>>>>> Found 2 documents to queue
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing
>>>>>> 1 documents
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document
>>>>>> 'smb://...' with bins [SERVER ] onto active queue
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document
>>>>>> 'smb://...' with bins [SERVER ] onto active queue
>>>>>> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Document stuffer
>>>>>> thread woke up
>>>>>>
>>>>>> Or did I miss something?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> 2016-07-19 12:51 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>>>>>>
>>>>>>> Hi Konstantin,
>>>>>>>
>>>>>>> The stuffer thread operates independently of the worker threads.
 It
>>>>>>> wakes up, and if anything is available, stuffs as much as it
can, and when
>>>>>>> done sleeps for a short time.  The queue is maintained at at
least twice
>>>>>>> the number of active worker threads.  See ManifoldCF in Action.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 19, 2016 at 6:33 AM, jetnet <jetnet@gmail.com>
wrote:
>>>>>>>
>>>>>>>> hi All,
>>>>>>>>
>>>>>>>> I've encountered recently an issue with the crawler (JCIFS
>>>>>>>> connector): when a jobs gets started, all it's documents
are being checked,
>>>>>>>> and this process is taking too long. After tuning DEBUG on,
I found, that
>>>>>>>> there are ~2 seconds delay when processing the document queue:
>>>>>>>>
>>>>>>>> e.g.:
>>>>>>>>
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - JCIFS:
Leaving
>>>>>>>> wouldFileBeIncluded for 'smb://...
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Worker
thread
>>>>>>>> done processing 1 documents
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding
>>>>>>>> 1453999232278 to finishList
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding
>>>>>>>> 1453999232278 to ingesterCheckList
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Finishing
>>>>>>>> documents {1453999232278 }
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeueing
>>>>>>>> documents due to carrydown {}
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeuing
>>>>>>>> {1453999232278 }
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Deleting
{}
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Hopcount
>>>>>>>> removal {}
>>>>>>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Rescanning
>>>>>>>> documents {}
>>>>>>>> DEBUG 2016-07-19 10:31:06,500 (Stuffer thread) - Stuffer
thread:
>>>>>>>> Found 0 documents to queue
>>>>>>>> DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread)
-
>>>>>>>> Document cleanup stuffer thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread)
-
>>>>>>>> Document delete stuffer thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread)
-
>>>>>>>> Document cleanup stuffer thread found nothing to do
>>>>>>>> DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread)
-
>>>>>>>> Document delete stuffer thread found nothing to do
>>>>>>>> DEBUG 2016-07-19 10:31:07,375 (Set priority thread) - Done
>>>>>>>> reprioritizing because no more documents to reprioritize
>>>>>>>> DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread)
-
>>>>>>>> Document cleanup stuffer thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread)
-
>>>>>>>> Document delete stuffer thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread)
-
>>>>>>>> Document delete stuffer thread found nothing to do
>>>>>>>> DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread)
-
>>>>>>>> Document cleanup stuffer thread found nothing to do
>>>>>>>> DEBUG 2016-07-19 10:31:08,500 (Stuffer thread) - Document
stuffer
>>>>>>>> thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:08,516 (Stuffer thread) - Stuffer
thread:
>>>>>>>> Found 2 documents to queue
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Stuffer thread) - Document
stuffer
>>>>>>>> thread woke up
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker
thread
>>>>>>>> processing documents: 1453999191642
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker
thread
>>>>>>>> processing documents: 1453999188326
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker
thread
>>>>>>>> starting document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker
thread
>>>>>>>> starting document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') -
>>>>>>>> Post-relationship document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') -
>>>>>>>> Post-relationship document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') -  Post-hopcount
>>>>>>>> pruned document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') -  Post-hopcount
>>>>>>>> pruned document count is 1
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker
thread
>>>>>>>> about to process {1453999191642 }
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker
thread
>>>>>>>> about to process {1453999188326 }
>>>>>>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - JCIFS:
>>>>>>>> Processing 'smb://...
>>>>>>>>
>>>>>>>>
>>>>>>>> As one can see, there is a delay between the thread-61 has
left the
>>>>>>>> checks and next 2 threads have started.
>>>>>>>> Any idea why that happens? Why the document stuffer thread
does not
>>>>>>>> wake up immediately? Or - why the "document queue batch size"
is only 2?
>>>>>>>>
>>>>>>>> P.S. MCF version 2.3
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message