manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: 2 seconds delay when checking documents
Date Tue, 19 Jul 2016 13:31:31 GMT
Hi Konstantin,

It's not possible to figure out from the log why the stuffer thread queued
only two documents at that time.  Probably it's because there were no other
eligible documents.  Documents in the queue have a priority and they also
have a minimum time they can be acted upon.  If you have documents that
failed earlier, they get requeued depending on the particular kind of
failure (up to the connector) with specific times.  There's no way to make
the queue resolve "faster" if it's full of documents that were requeued in
this way due to errors.

You can figure all this out by using the Document Status and Queue Status
reports.

Karl


On Tue, Jul 19, 2016 at 7:29 AM, jetnet <jetnet@gmail.com> wrote:

> Hi Karl,
>
> the documentation states: "it [Document Stuffer Thread] stuffs some number
> that is a
> multiple of the number of Worker Threads at one time, up to some limit
> (which is also
> related to the number of Worker Threads)."
>
> here is the worker threads config:
>
>   <property name="org.apache.manifoldcf.database.maxhandles" value="200"/>
>   <property name="org.apache.manifoldcf.crawler.threads" value="100"/>
>
>
> So, I'd expect, that the stuffer thread would put about 200 docs in the
> queue, but it puts only 2...
>
> DEBUG 2016-07-19 13:24:02,916 (Stuffer thread) - Document stuffer thread
> woke up
> DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document
> 'smb://...' that has priority 10.284489143231797 onto active list
> DEBUG 2016-07-19 13:24:02,963 (Stuffer thread) - Stuffing document
> 'smb://...' that has priority 10.28520622168017 onto active list
> DEBUG 2016-07-19 13:24:02,963 (Set priority thread) - Reprioritizing 1
> documents
> DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing 1
> documents
> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Setting new minimum depth
> value to 29273.99867577113
> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Stuffer thread: Found 2
> documents to queue
> DEBUG 2016-07-19 13:24:02,979 (Set priority thread) - Reprioritizing 1
> documents
> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document
> 'smb://...' with bins [SERVER ] onto active queue
> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Putting document
> 'smb://...' with bins [SERVER ] onto active queue
> DEBUG 2016-07-19 13:24:02,979 (Stuffer thread) - Document stuffer thread
> woke up
>
> Or did I miss something?
>
> Thanks!
>
>
> 2016-07-19 12:51 GMT+02:00 Karl Wright <daddywri@gmail.com>:
>
>> Hi Konstantin,
>>
>> The stuffer thread operates independently of the worker threads.  It
>> wakes up, and if anything is available, stuffs as much as it can, and when
>> done sleeps for a short time.  The queue is maintained at at least twice
>> the number of active worker threads.  See ManifoldCF in Action.
>>
>> Karl
>>
>>
>> On Tue, Jul 19, 2016 at 6:33 AM, jetnet <jetnet@gmail.com> wrote:
>>
>>> hi All,
>>>
>>> I've encountered recently an issue with the crawler (JCIFS connector):
>>> when a jobs gets started, all it's documents are being checked, and this
>>> process is taking too long. After tuning DEBUG on, I found, that there are
>>> ~2 seconds delay when processing the document queue:
>>>
>>> e.g.:
>>>
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - JCIFS: Leaving
>>> wouldFileBeIncluded for 'smb://...
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Worker thread done
>>> processing 1 documents
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding
>>> 1453999232278 to finishList
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Adding
>>> 1453999232278 to ingesterCheckList
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Finishing documents
>>> {1453999232278 }
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeueing
>>> documents due to carrydown {}
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') -  Requeuing
>>> {1453999232278 }
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Deleting {}
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Hopcount removal {}
>>> DEBUG 2016-07-19 10:31:06,484 (Worker thread '61') - Rescanning
>>> documents {}
>>> DEBUG 2016-07-19 10:31:06,500 (Stuffer thread) - Stuffer thread: Found 0
>>> documents to queue
>>> DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread) -
>>> Document cleanup stuffer thread woke up
>>> DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread) -
>>> Document delete stuffer thread woke up
>>> DEBUG 2016-07-19 10:31:06,750 (Document cleanup stuffer thread) -
>>> Document cleanup stuffer thread found nothing to do
>>> DEBUG 2016-07-19 10:31:06,750 (Document delete stuffer thread) -
>>> Document delete stuffer thread found nothing to do
>>> DEBUG 2016-07-19 10:31:07,375 (Set priority thread) - Done
>>> reprioritizing because no more documents to reprioritize
>>> DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread) -
>>> Document cleanup stuffer thread woke up
>>> DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread) -
>>> Document delete stuffer thread woke up
>>> DEBUG 2016-07-19 10:31:07,750 (Document delete stuffer thread) -
>>> Document delete stuffer thread found nothing to do
>>> DEBUG 2016-07-19 10:31:07,750 (Document cleanup stuffer thread) -
>>> Document cleanup stuffer thread found nothing to do
>>> DEBUG 2016-07-19 10:31:08,500 (Stuffer thread) - Document stuffer thread
>>> woke up
>>> DEBUG 2016-07-19 10:31:08,516 (Stuffer thread) - Stuffer thread: Found 2
>>> documents to queue
>>> DEBUG 2016-07-19 10:31:08,531 (Stuffer thread) - Document stuffer thread
>>> woke up
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread
>>> processing documents: 1453999191642
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread
>>> processing documents: 1453999188326
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread
>>> starting document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread
>>> starting document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Post-relationship
>>> document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Post-relationship
>>> document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') -  Post-hopcount
>>> pruned document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') -  Post-hopcount
>>> pruned document count is 1
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - Worker thread about
>>> to process {1453999191642 }
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '57') - Worker thread about
>>> to process {1453999188326 }
>>> DEBUG 2016-07-19 10:31:08,531 (Worker thread '62') - JCIFS: Processing
>>> 'smb://...
>>>
>>>
>>> As one can see, there is a delay between the thread-61 has left the
>>> checks and next 2 threads have started.
>>> Any idea why that happens? Why the document stuffer thread does not wake
>>> up immediately? Or - why the "document queue batch size" is only 2?
>>>
>>> P.S. MCF version 2.3
>>>
>>> Thanks!
>>>
>>> --
>>> Konstantin
>>>
>>
>>
>

Mime
View raw message