manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Sharepoint Job - Incremental Crawling
Date Fri, 08 Feb 2019 21:33:16 GMT
The problem is not the speed of Manifold, but rather the work it has to do
and the performance of SharePoint.  All the speed in the world in the
crawler will not fix the bottleneck that is SharePoint.

Karl


On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <goyalgauravg@gmail.com> wrote:

> Got it.
> Is there any way we can increase the speed of the minimal crawl. Currently
> we are running one VM for manifold with 8 cores and 32 gb Ram. Postgres
> runs on another machine with a similar configuration. Have tuned the
> Postgres and Manifoldcf parameters as per the recommendations. We run a
> full vacuum once daily.
>
> Would switching to a multi process configuration with manifoldcf running
> on two servers give a boost.
>
> Thanks,
> Gaurav
>
> On Saturday, February 9, 2019, Karl Wright <daddywri@gmail.com> wrote:
>
>> It does the minimum necessary.  That means it can't do it in less.  If
>> this is a business requirement, then you should be angry with whoever made
>> this requirement.
>>
>> Share point doesn't give you the ability to grab all changes or added
>> documents up front.   You have to crawl to discover them.  That is how it
>> is built and mcf cannot change it.
>>
>> Karl
>>
>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>>> processing state and takes about 20-25 mins to stop. Then the question
>>> becomes when to schedule the next crawl. And also in those 15 minutes would
>>> it have picked all the adds and updates first or could they be part of the
>>> 3000 docs which are still in processing state which would get picked in the
>>> next run. The number of docs that actually change in a 30 min period won't
>>> be more than 200.
>>>
>>> Being able to capture adds and updates in 30 minutes is a key business
>>> requirement.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>> On Friday, February 8, 2019, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Guarav,
>>>>
>>>> The right way to do this is to schedule "minimal" crawls every 15
>>>> minutes (which will process only the minimum needed to deal with adds and
>>>> updates), and periodically perform "full" crawls (which will also include
>>>> deletions).
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <goyalgauravg@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally
>>>>> we would like to be able to synchronize changes with the repo within
30
>>>>> minutes. We are scheduling incremental crawling on this. Our observation
is
>>>>> that a full crawl takes about 60-75 minutes. So if we schedule the
>>>>> incremental crawl for 30 minutes, in what order would it process the
>>>>> changes. Would it first bring the adds and updates and then process the
>>>>> rest of the docs? What kind of logic is there in the incremental crawl?
>>>>> We also tried the Continuous crawl to achieve this. However somehow
>>>>> the continuous crawl was not picking up new documents.
>>>>>
>>>>> Thanks,
>>>>> Gaurav
>>>>>
>>>>

Mime
View raw message