manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Alonso Garcia <jalon...@gmail.com>
Subject Re: sharepoint crawler documents limit
Date Mon, 27 Jan 2020 09:04:53 GMT
Hi,
We had change timeout on sharepoint IIS and now the process is able to
crall all documents.
Thanks for your help



El lun., 30 dic. 2019 a las 12:18, Gaurav G (<goyalgauravg@gmail.com>)
escribió:

> We had faced a similar issue, wherein our repo had 100,000 documents but
> our crawler stopped after 50000 documents. The issue turned out to be that
> the Sharepoint query that was fired by the Sharepoint web service gets
> progressively slower and eventually the connection starts timing out before
> the next 10000 records get returned. We increased a timeout parameter on
> Sharepoint to 10 minutes and then after that we were able to crawl all
> documents successfully.  I believe we had increased the parameter indicated
> in the link below
>
>
> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>
>
>
> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Priya,
>>
>> This has nothing to do with anything in ManifoldCF.
>>
>> Karl
>>
>>
>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> Hi All,
>>>
>>> Is this issue something to have with below value/parameters set in
>>> properties.xml.
>>> [image: image.png]
>>>
>>>
>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <jalongar@gmail.com>
>>> wrote:
>>>
>>>> And what other sharepoint parameter I could check?
>>>>
>>>> Jorge Alonso Garcia
>>>>
>>>>
>>>>
>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<daddywri@gmail.com>)
>>>> escribió:
>>>>
>>>>> The code seems correct and many people are using it without
>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>> parameter you also need to look at somewhere.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>> jalongar@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi Karl,
>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>> 20,000 from mcf
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> Jorge Alonso Garcia
>>>>>>
>>>>>>
>>>>>>
>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddywri@gmail.com>)
>>>>>> escribió:
>>>>>>
>>>>>>> If the job finished without error it implies that the number
of
>>>>>>> documents returned from this one library was 10000 when the service
is
>>>>>>> called the first time (starting at doc 0), 10000 when it's called
the
>>>>>>> second time (starting at doc 10000), and zero when it is called
the third
>>>>>>> time (starting at doc 20000).
>>>>>>>
>>>>>>> The plugin code is unremarkable and actually gets results in
chunks
>>>>>>> of 1000 under the covers:
>>>>>>>
>>>>>>> >>>>>>
>>>>>>>                         SPQuery listQuery = new SPQuery();
>>>>>>>                         listQuery.Query = "<OrderBy
>>>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>>>>                         listQuery.QueryThrottleMode =
>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>                         listQuery.ViewAttributes =
>>>>>>> "Scope=\"Recursive\"";
>>>>>>>                         listQuery.ViewFields = "<FieldRef
>>>>>>> Name='FileRef' />";
>>>>>>>                         listQuery.RowLimit = 1000;
>>>>>>>
>>>>>>>                         XmlDocument doc = new XmlDocument();
>>>>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>>>>                             "
>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/");
>>>>>>>                         XmlNode getListItemsNode =
>>>>>>> doc.CreateElement("GetListItemsResponse");
>>>>>>>
>>>>>>>                         uint counter = 0;
>>>>>>>                         do
>>>>>>>                         {
>>>>>>>                             if (counter >= startRowParam +
>>>>>>> rowLimitParam)
>>>>>>>                                 break;
>>>>>>>
>>>>>>>                             SPListItemCollection collListItems
=
>>>>>>> oList.GetItems(listQuery);
>>>>>>>
>>>>>>>
>>>>>>>                             foreach (SPListItem oListItem in
>>>>>>> collListItems)
>>>>>>>                             {
>>>>>>>                                 if (counter >= startRowParam
&&
>>>>>>> counter < startRowParam + rowLimitParam)
>>>>>>>                                 {
>>>>>>>                                     XmlNode resultNode =
>>>>>>> doc.CreateElement("GetListItemsResult");
>>>>>>>                                     XmlAttribute idAttribute
=
>>>>>>> doc.CreateAttribute("FileRef");
>>>>>>>                                     idAttribute.Value =
>>>>>>> oListItem.Url;
>>>>>>>
>>>>>>> resultNode.Attributes.Append(idAttribute);
>>>>>>>                                     XmlAttribute urlAttribute
=
>>>>>>> doc.CreateAttribute("ListItemURL");
>>>>>>>                                     //urlAttribute.Value =
>>>>>>> oListItem.ParentList.DefaultViewUrl;
>>>>>>>                                     urlAttribute.Value =
>>>>>>> string.Format("{0}?ID={1}",
>>>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>>>>> oListItem.ID);
>>>>>>>
>>>>>>> resultNode.Attributes.Append(urlAttribute);
>>>>>>>
>>>>>>> getListItemsNode.AppendChild(resultNode);
>>>>>>>                                 }
>>>>>>>                                 counter++;
>>>>>>>                             }
>>>>>>>
>>>>>>>                             listQuery.ListItemCollectionPosition
=
>>>>>>> collListItems.ListItemCollectionPosition;
>>>>>>>
>>>>>>>                         } while
>>>>>>> (listQuery.ListItemCollectionPosition != null);
>>>>>>>
>>>>>>>                         retVal.AppendChild(getListItemsNode);
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> The code is clearly working if you get 20000 results returned,
so I
>>>>>>> submit that perhaps there's a configured limit in your SharePoint
instance
>>>>>>> that prevents listing more than 20000.  That's the only way I
can explain
>>>>>>> this.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> The job finnish ok (several times) but always with this 20000
>>>>>>>> documents, for some reason the loop only execute twice
>>>>>>>>
>>>>>>>> Jorge Alonso Garcia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddywri@gmail.com>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> If the are all in one document, then you'd be running
this code:
>>>>>>>>>
>>>>>>>>> >>>>>>
>>>>>>>>>         int startingIndex = 0;
>>>>>>>>>         int amtToRequest = 10000;
>>>>>>>>>         while (true)
>>>>>>>>>         {
>>>>>>>>>
>>>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>>>>> itemsResult =
>>>>>>>>>
>>>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>>>>
>>>>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>>>>
>>>>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>>>>             Logging.connectors.debug("SharePoint: getChildren
xml
>>>>>>>>> response: " + itemsList[0].toString());
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           if (itemsList.length != 1)
>>>>>>>>>             throw new ManifoldCFException("Bad response
-
>>>>>>>>> expecting one outer 'GetListItems' node, saw
>>>>>>>>> "+Integer.toString(itemsList.length));
>>>>>>>>>
>>>>>>>>>           MessageElement items = itemsList[0];
>>>>>>>>>           if
>>>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>>>>             throw new ManifoldCFException("Bad response
- outer
>>>>>>>>> node should have been 'GetListItems' node");
>>>>>>>>>
>>>>>>>>>           int resultCount = 0;
>>>>>>>>>           Iterator iter = items.getChildElements();
>>>>>>>>>           while (iter.hasNext())
>>>>>>>>>           {
>>>>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>>>>             if
>>>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>>>>             {
>>>>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>>>>               while (resultIter.hasNext())
>>>>>>>>>               {
>>>>>>>>>                 MessageElement result =
>>>>>>>>> (MessageElement)resultIter.next();
>>>>>>>>>                 if
>>>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>>>>                 {
>>>>>>>>>                   resultCount++;
>>>>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>>>>                   String displayURL =
>>>>>>>>> result.getAttribute("ListItemURL");
>>>>>>>>>                   fileStream.addFile( relPath, displayURL
);
>>>>>>>>>                 }
>>>>>>>>>               }
>>>>>>>>>
>>>>>>>>>             }
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           if (resultCount < amtToRequest)
>>>>>>>>>             break;
>>>>>>>>>
>>>>>>>>>           startingIndex += resultCount;
>>>>>>>>>         }
>>>>>>>>> <<<<<<
>>>>>>>>>
>>>>>>>>> What this does is request library content URLs in chunks
of
>>>>>>>>> 10000.  It stops when it receives less than 10000 documents
from any one
>>>>>>>>> request.
>>>>>>>>>
>>>>>>>>> If the documents were all in one library, then one call
to the web
>>>>>>>>> service yielded 10000 documents, and the second call
yielded 10000
>>>>>>>>> documents, and there was no third call for no reason
I can figure out.
>>>>>>>>> Since 10000 documents were returned each time the loop
ought to just
>>>>>>>>> continue, unless there was some kind of error.  Does
the job succeed, or
>>>>>>>>> does it abort?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you are using the MCF plugin, and selecting the
appropriate
>>>>>>>>>> version of Sharepoint in the connection configuration,
there is no hard
>>>>>>>>>> limit I'm aware of for any Sharepoint job.  We have
lots of other people
>>>>>>>>>> using SharePoint and nobody has reported this ever
before.
>>>>>>>>>>
>>>>>>>>>> If your SharePoint connection says "SharePoint 2003"
as the
>>>>>>>>>> SharePoint version, then sure, that would be expected
behavior.  So please
>>>>>>>>>> check that first.
>>>>>>>>>>
>>>>>>>>>> The other question I have is your description of
you first
>>>>>>>>>> getting 10001 documents and then later 20002.  That's
not how ManifoldCF
>>>>>>>>>> works.  At the start of the crawl, seeds are added;
this would start out
>>>>>>>>>> just being the root, and then other documents would
be discovered as the
>>>>>>>>>> crawl proceeded, after subsites and libraries are
discovered.  So I am
>>>>>>>>>> still trying to square that with your description
of how this is working
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> Are all of your documents in one library?  Or two
libraries?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia
<
>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> On UI shows 20,002 documents (on a firts phase
show 10,001,and
>>>>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>>>>> It looks like a hard limit, there is more files
on sharepoint
>>>>>>>>>>> with the used criteria
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright
(<
>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Jorge,
>>>>>>>>>>>>
>>>>>>>>>>>> When you run the job, do you see more than
20,000 documents as
>>>>>>>>>>>> part of it?
>>>>>>>>>>>>
>>>>>>>>>>>> Do you see *exactly* 20,000 documents as
part of it?
>>>>>>>>>>>>
>>>>>>>>>>>> Unless you are seeing a hard number like
that in the UI for
>>>>>>>>>>>> that job on the job status page, I doubt
very much that the problem is a
>>>>>>>>>>>> numerical limitation in the number of documents.
 I would suspect that the
>>>>>>>>>>>> inclusion criteria, e.g. the mime type or
maximum length, is excluding
>>>>>>>>>>>> documents.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso
Garcia <
>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>> We had installed the shaterpoint plugin,
and access properly
>>>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sharepoint has more than 20,000 documents,
but when execute
>>>>>>>>>>>>> the jon only extract these 20,000. How
Can I check where is the issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl
Wright (<
>>>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> By "stop at 20,000" do you mean that
it finds more than
>>>>>>>>>>>>>> 20,000 but stops crawling at that
time?  Or what exactly do you mean here?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, the behavior you describe sounds
like you may not have
>>>>>>>>>>>>>> installed the SharePoint plugin and
may have selected a version of
>>>>>>>>>>>>>> SharePoint that is inappropriate.
 All SharePoint versions after 2008 limit
>>>>>>>>>>>>>> the number of documents returned
using the standard web services methods.
>>>>>>>>>>>>>> The plugin allows us to bypass that
hard limit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge
Alonso Garcia <
>>>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> We have an isuse with sharepoint
connector.
>>>>>>>>>>>>>>> There is a job that crawl a sharepoint
2016, but it is not
>>>>>>>>>>>>>>> recovering all files, it stop
at 20.000 documents without any error.
>>>>>>>>>>>>>>> Is there any parameter that should
be change to avoid this
>>>>>>>>>>>>>>> limitation?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Mime
View raw message