manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: stuffamountfactor and getting more work done
Date Fri, 12 Dec 2014 12:25:42 GMT
FWIW, you can diagnose a slow stuffer query by getting a thread dump.  If
there are tons of idle worker threads AND your stuffer thread is waiting on
Postgresql, that's a good sign it is not keeping up due to database reasons.

Karl


On Fri, Dec 12, 2014 at 7:23 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Aeham,
>
> Before you assume that stuffing is just not happening fast enough, you
> will want to confirm that you have enough documents that are *eligible* for
> processing.  In a continuous job, documents may well be scheduled to be
> crawled at some time in the future, and are ineligible for crawling until
> that future time arrives.  You can get a better sense of this by using the
> document and queue status reports.
>
> If you only have 30 worker threads on your machine, it's extremely
> unlikely that you would find yourself unable to stuff documents fast enough
> with the default parameters.  The only way that would not be true is if
> your stuffer queries are performing badly, and that would be important to
> know too.
>
> Thanks,
> Karl
>
>
>
>
> On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi <
> aeham.abushwashi@exonar.com> wrote:
>>
>> Hi,
>>
>> Are there any gotchas one should be aware of when configuring property
>> "org.apache.manifoldcf.crawler.stuffamountfactor"?
>>
>> At times, I see the manifold nodes in my cluster (and the postgresql box)
>> not utilising all the resources they have. I have configured 30 worker
>> threads which tend to sit idle waiting for documents (continuous crawl).
>> This led me to tweak the batch size of the Stuffer thread indirectly using
>> "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I
>> believe the default is 2).
>>
>> I understand that increasing the batch size results in a bigger result
>> set coming back from the database. If the size is in the 1000s I doubt it
>> would cause problems. My hope is a bigger stuffer batch would allow worker
>> threads to operate more efficiently and handle more documents where
>> possible.
>>
>> Please let me know if there are any particular concerns/guidelines over
>> tweaking this config property or if there are better ways for increasing
>> the width of the processing pipeline for each manifold instance.
>>
>> Thanks,
>> Aeham
>>
>

Mime
View raw message