manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Alonso Garcia <jalon...@gmail.com>
Subject Re: sharepoint crawler documents limit
Date Fri, 20 Dec 2019 11:38:01 GMT
Hi Karl,
On sharepoint the list view threshold is 150,000 but we only receipt 20,000
from mcf
[image: image.png]


Jorge Alonso Garcia



El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddywri@gmail.com>)
escribió:

> If the job finished without error it implies that the number of documents
> returned from this one library was 10000 when the service is called the
> first time (starting at doc 0), 10000 when it's called the second time
> (starting at doc 10000), and zero when it is called the third time
> (starting at doc 20000).
>
> The plugin code is unremarkable and actually gets results in chunks of
> 1000 under the covers:
>
> >>>>>>
>                         SPQuery listQuery = new SPQuery();
>                         listQuery.Query = "<OrderBy
> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>                         listQuery.QueryThrottleMode =
> SPQueryThrottleOption.Override;
>                         listQuery.ViewAttributes = "Scope=\"Recursive\"";
>                         listQuery.ViewFields = "<FieldRef Name='FileRef'
> />";
>                         listQuery.RowLimit = 1000;
>
>                         XmlDocument doc = new XmlDocument();
>                         retVal = doc.CreateElement("GetListItems",
>                             "
> http://schemas.microsoft.com/sharepoint/soap/directory/");
>                         XmlNode getListItemsNode =
> doc.CreateElement("GetListItemsResponse");
>
>                         uint counter = 0;
>                         do
>                         {
>                             if (counter >= startRowParam + rowLimitParam)
>                                 break;
>
>                             SPListItemCollection collListItems =
> oList.GetItems(listQuery);
>
>
>                             foreach (SPListItem oListItem in collListItems)
>                             {
>                                 if (counter >= startRowParam && counter <
> startRowParam + rowLimitParam)
>                                 {
>                                     XmlNode resultNode =
> doc.CreateElement("GetListItemsResult");
>                                     XmlAttribute idAttribute =
> doc.CreateAttribute("FileRef");
>                                     idAttribute.Value = oListItem.Url;
>
> resultNode.Attributes.Append(idAttribute);
>                                     XmlAttribute urlAttribute =
> doc.CreateAttribute("ListItemURL");
>                                     //urlAttribute.Value =
> oListItem.ParentList.DefaultViewUrl;
>                                     urlAttribute.Value =
> string.Format("{0}?ID={1}",
> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
> oListItem.ID);
>
> resultNode.Attributes.Append(urlAttribute);
>
> getListItemsNode.AppendChild(resultNode);
>                                 }
>                                 counter++;
>                             }
>
>                             listQuery.ListItemCollectionPosition =
> collListItems.ListItemCollectionPosition;
>
>                         } while (listQuery.ListItemCollectionPosition !=
> null);
>
>                         retVal.AppendChild(getListItemsNode);
> <<<<<<
>
> The code is clearly working if you get 20000 results returned, so I submit
> that perhaps there's a configured limit in your SharePoint instance that
> prevents listing more than 20000.  That's the only way I can explain this.
>
> Karl
>
>
> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <jalongar@gmail.com>
> wrote:
>
>> Hi,
>> The job finnish ok (several times) but always with this 20000 documents,
>> for some reason the loop only execute twice
>>
>> Jorge Alonso Garcia
>>
>>
>>
>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddywri@gmail.com>)
>> escribió:
>>
>>> If the are all in one document, then you'd be running this code:
>>>
>>> >>>>>>
>>>         int startingIndex = 0;
>>>         int amtToRequest = 10000;
>>>         while (true)
>>>         {
>>>
>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>> itemsResult =
>>>
>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>
>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>
>>>           if (Logging.connectors.isDebugEnabled()){
>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>> response: " + itemsList[0].toString());
>>>           }
>>>
>>>           if (itemsList.length != 1)
>>>             throw new ManifoldCFException("Bad response - expecting one
>>> outer 'GetListItems' node, saw "+Integer.toString(itemsList.length));
>>>
>>>           MessageElement items = itemsList[0];
>>>           if
>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>             throw new ManifoldCFException("Bad response - outer node
>>> should have been 'GetListItems' node");
>>>
>>>           int resultCount = 0;
>>>           Iterator iter = items.getChildElements();
>>>           while (iter.hasNext())
>>>           {
>>>             MessageElement child = (MessageElement)iter.next();
>>>             if
>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>             {
>>>               Iterator resultIter = child.getChildElements();
>>>               while (resultIter.hasNext())
>>>               {
>>>                 MessageElement result =
>>> (MessageElement)resultIter.next();
>>>                 if
>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>                 {
>>>                   resultCount++;
>>>                   String relPath = result.getAttribute("FileRef");
>>>                   String displayURL = result.getAttribute("ListItemURL");
>>>                   fileStream.addFile( relPath, displayURL );
>>>                 }
>>>               }
>>>
>>>             }
>>>           }
>>>
>>>           if (resultCount < amtToRequest)
>>>             break;
>>>
>>>           startingIndex += resultCount;
>>>         }
>>> <<<<<<
>>>
>>> What this does is request library content URLs in chunks of 10000.  It
>>> stops when it receives less than 10000 documents from any one request.
>>>
>>> If the documents were all in one library, then one call to the web
>>> service yielded 10000 documents, and the second call yielded 10000
>>> documents, and there was no third call for no reason I can figure out.
>>> Since 10000 documents were returned each time the loop ought to just
>>> continue, unless there was some kind of error.  Does the job succeed, or
>>> does it abort?
>>>
>>> Karl
>>>
>>>
>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> If you are using the MCF plugin, and selecting the appropriate version
>>>> of Sharepoint in the connection configuration, there is no hard limit I'm
>>>> aware of for any Sharepoint job.  We have lots of other people using
>>>> SharePoint and nobody has reported this ever before.
>>>>
>>>> If your SharePoint connection says "SharePoint 2003" as the SharePoint
>>>> version, then sure, that would be expected behavior.  So please check that
>>>> first.
>>>>
>>>> The other question I have is your description of you first getting
>>>> 10001 documents and then later 20002.  That's not how ManifoldCF works. 
At
>>>> the start of the crawl, seeds are added; this would start out just being
>>>> the root, and then other documents would be discovered as the crawl
>>>> proceeded, after subsites and libraries are discovered.  So I am still
>>>> trying to square that with your description of how this is working for you.
>>>>
>>>> Are all of your documents in one library?  Or two libraries?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>> jalongar@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and after
>>>>> sometime of process raise to 20,002) .
>>>>> It looks like a hard limit, there is more files on sharepoint with the
>>>>> used criteria
>>>>>
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<daddywri@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> Hi Jorge,
>>>>>>
>>>>>> When you run the job, do you see more than 20,000 documents as part
>>>>>> of it?
>>>>>>
>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>
>>>>>> Unless you are seeing a hard number like that in the UI for that
job
>>>>>> on the job status page, I doubt very much that the problem is a numerical
>>>>>> limitation in the number of documents.  I would suspect that the
inclusion
>>>>>> criteria, e.g. the mime type or maximum length, is excluding documents.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>> jalongar@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> Sharepoint has more than 20,000 documents, but when execute the
jon
>>>>>>> only extract these 20,000. How Can I check where is the issue?
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<daddywri@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> By "stop at 20,000" do you mean that it finds more than 20,000
but
>>>>>>>> stops crawling at that time?  Or what exactly do you mean
here?
>>>>>>>>
>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>> installed the SharePoint plugin and may have selected a version
of
>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions
after 2008 limit
>>>>>>>> the number of documents returned using the standard web services
methods.
>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is
not
>>>>>>>>> recovering all files, it stop at 20.000 documents without
any error.
>>>>>>>>> Is there any parameter that should be change to avoid
this
>>>>>>>>> limitation?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>
>>>>>>>>>

Mime
View raw message