manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav G <goyalgaur...@gmail.com>
Subject Re: Sharepoint Job - Incremental Crawling
Date Fri, 08 Feb 2019 21:06:11 GMT
Got it.
Is there any way we can increase the speed of the minimal crawl. Currently
we are running one VM for manifold with 8 cores and 32 gb Ram. Postgres
runs on another machine with a similar configuration. Have tuned the
Postgres and Manifoldcf parameters as per the recommendations. We run a
full vacuum once daily.

Would switching to a multi process configuration with manifoldcf running on
two servers give a boost.

Thanks,
Gaurav

On Saturday, February 9, 2019, Karl Wright <daddywri@gmail.com> wrote:

> It does the minimum necessary.  That means it can't do it in less.  If
> this is a business requirement, then you should be angry with whoever made
> this requirement.
>
> Share point doesn't give you the ability to grab all changes or added
> documents up front.   You have to crawl to discover them.  That is how it
> is built and mcf cannot change it.
>
> Karl
>
> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>
>> Hi Karl,
>>
>> Thanks for the response. We tried scheduling minimal crawl for 15
>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>> processing state and takes about 20-25 mins to stop. Then the question
>> becomes when to schedule the next crawl. And also in those 15 minutes would
>> it have picked all the adds and updates first or could they be part of the
>> 3000 docs which are still in processing state which would get picked in the
>> next run. The number of docs that actually change in a 30 min period won't
>> be more than 200.
>>
>> Being able to capture adds and updates in 30 minutes is a key business
>> requirement.
>>
>> Thanks,
>> Gaurav
>>
>> On Friday, February 8, 2019, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Guarav,
>>>
>>> The right way to do this is to schedule "minimal" crawls every 15
>>> minutes (which will process only the minimum needed to deal with adds and
>>> updates), and periodically perform "full" crawls (which will also include
>>> deletions).
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <goyalgauravg@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally
>>>> we would like to be able to synchronize changes with the repo within 30
>>>> minutes. We are scheduling incremental crawling on this. Our observation
is
>>>> that a full crawl takes about 60-75 minutes. So if we schedule the
>>>> incremental crawl for 30 minutes, in what order would it process the
>>>> changes. Would it first bring the adds and updates and then process the
>>>> rest of the docs? What kind of logic is there in the incremental crawl?
>>>> We also tried the Continuous crawl to achieve this. However somehow the
>>>> continuous crawl was not picking up new documents.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>

Mime
View raw message