manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: Sharepoint Job - Incremental Crawling
Date Sat, 09 Feb 2019 13:08:33 GMT
Hi. I just saw this thread.
I believe Msft recommends a dedicated document source instance for larger
corpora.
I know in my SP days we were often frustrating users by making SP very slow
while we were crawling. Which was mostly solved by having a dedicated
source node.
S

On Sat, Feb 9, 2019, 2:10 AM Karl Wright <daddywri@gmail.com wrote:

> Hi Guarav,
>
> The number of connections you permit should depend on the resources on the
> Sharepoint instance you're crawling.  ManifoldCF will limit the number of
> connections to that instance to the number you select.  Making it larger
> might help if there's a lot of resources on the SharePoint side, but in my
> experience that's usually not realistic and just increasing the connection
> count can even have a paradoxical effect.  So that will require a back and
> forth with the people running the Sharepoint instances.
>
> Once you can confirm that SharePoint is no longer the bottleneck (I'm
> pretty certain it is right now), then the next step would be database
> performance optimization.  For Postgres running on Linux, you should be
> pretty much pegging the CPUs on the DB machine if you've got all the other
> bottlenecks eliminated.  If you aren't pegging those CPUs and/or the
> machine is IO bound, there has to be another bottleneck somewhere and
> you'll need to find it.
>
> Karl
>
>
> On Sat, Feb 9, 2019 at 1:10 AM Gaurav G <goyalgauravg@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Thanks for your insights. So I'm thinking of exploring the following
>> options to get the most optimal performance. Your thoughts..Is the first
>> option, the one which might give the most bang for the buck?
>>
>> 1) Ask the Sharepoint application team to dedicate a web and app server
>> specifically for crawling. Also on a related point, is there any optimal
>> value for the number of concurrent repository connections? Currently we
>> have it at about 40, not sure if increasing it further will improve speeds.
>> 2) Splitting the crawling between two sets of manifold and postgres
>> servers running on 4 different VMs but with lesser config..say 4 cores, 12
>> GB RAM.
>> 3) Co-locate the crawlers in the same data center as the sharepoint
>> servers. Currently they are in different DCs with dedicated MPLS
>> connectivity.
>>
>> Thanks,
>> Gaurav
>>
>> On Sat, Feb 9, 2019 at 3:03 AM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The problem is not the speed of Manifold, but rather the work it has to
>>> do and the performance of SharePoint.  All the speed in the world in the
>>> crawler will not fix the bottleneck that is SharePoint.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <goyalgauravg@gmail.com> wrote:
>>>
>>>> Got it.
>>>> Is there any way we can increase the speed of the minimal crawl.
>>>> Currently we are running one VM for manifold with 8 cores and 32 gb Ram.
>>>> Postgres runs on another machine with a similar configuration. Have tuned
>>>> the Postgres and Manifoldcf parameters as per the recommendations. We run
a
>>>> full vacuum once daily.
>>>>
>>>> Would switching to a multi process configuration with manifoldcf
>>>> running on two servers give a boost.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>> On Saturday, February 9, 2019, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> It does the minimum necessary.  That means it can't do it in less.  If
>>>>> this is a business requirement, then you should be angry with whoever
made
>>>>> this requirement.
>>>>>
>>>>> Share point doesn't give you the ability to grab all changes or added
>>>>> documents up front.   You have to crawl to discover them.  That is how
it
>>>>> is built and mcf cannot change it.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>>>>> minutes. At the end of fifteen minutes it stops with about 3000 docs
in
>>>>>> processing state and takes about 20-25 mins to stop. Then the question
>>>>>> becomes when to schedule the next crawl. And also in those 15 minutes
would
>>>>>> it have picked all the adds and updates first or could they be part
of the
>>>>>> 3000 docs which are still in processing state which would get picked
in the
>>>>>> next run. The number of docs that actually change in a 30 min period
won't
>>>>>> be more than 200.
>>>>>>
>>>>>> Being able to capture adds and updates in 30 minutes is a key
>>>>>> business requirement.
>>>>>>
>>>>>> Thanks,
>>>>>> Gaurav
>>>>>>
>>>>>> On Friday, February 8, 2019, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Hi Guarav,
>>>>>>>
>>>>>>> The right way to do this is to schedule "minimal" crawls every
15
>>>>>>> minutes (which will process only the minimum needed to deal with
adds and
>>>>>>> updates), and periodically perform "full" crawls (which will
also include
>>>>>>> deletions).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <goyalgauravg@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We're trying to crawl a Sharepoint repo with about 30000
docs.
>>>>>>>> Ideally we would like to be able to synchronize changes with
the repo
>>>>>>>> within 30 minutes. We are scheduling incremental crawling
on this. Our
>>>>>>>> observation is that a full crawl takes about 60-75 minutes.
So if we
>>>>>>>> schedule the incremental crawl for 30 minutes, in what order
would it
>>>>>>>> process the changes. Would it first bring the adds and updates
and then
>>>>>>>> process the rest of the docs? What kind of logic is there
in the
>>>>>>>> incremental crawl?
>>>>>>>> We also tried the Continuous crawl to achieve this. However
somehow
>>>>>>>> the continuous crawl was not picking up new documents.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Gaurav
>>>>>>>>
>>>>>>>

Mime
View raw message