manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Alonso Garcia <jalon...@gmail.com>
Subject Re: sharepoint crawler documents limit
Date Fri, 20 Dec 2019 11:51:05 GMT
And what other sharepoint parameter I could check?

Jorge Alonso Garcia



El vie., 20 dic. 2019 a las 12:47, Karl Wright (<daddywri@gmail.com>)
escribió:

> The code seems correct and many people are using it without encountering
> this problem.  There may be another SharePoint configuration parameter you
> also need to look at somewhere.
>
> Karl
>
>
> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <jalongar@gmail.com>
> wrote:
>
>>
>> Hi Karl,
>> On sharepoint the list view threshold is 150,000 but we only receipt
>> 20,000 from mcf
>> [image: image.png]
>>
>>
>> Jorge Alonso Garcia
>>
>>
>>
>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddywri@gmail.com>)
>> escribió:
>>
>>> If the job finished without error it implies that the number of
>>> documents returned from this one library was 10000 when the service is
>>> called the first time (starting at doc 0), 10000 when it's called the
>>> second time (starting at doc 10000), and zero when it is called the third
>>> time (starting at doc 20000).
>>>
>>> The plugin code is unremarkable and actually gets results in chunks of
>>> 1000 under the covers:
>>>
>>> >>>>>>
>>>                         SPQuery listQuery = new SPQuery();
>>>                         listQuery.Query = "<OrderBy
>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>                         listQuery.QueryThrottleMode =
>>> SPQueryThrottleOption.Override;
>>>                         listQuery.ViewAttributes = "Scope=\"Recursive\"";
>>>                         listQuery.ViewFields = "<FieldRef Name='FileRef'
>>> />";
>>>                         listQuery.RowLimit = 1000;
>>>
>>>                         XmlDocument doc = new XmlDocument();
>>>                         retVal = doc.CreateElement("GetListItems",
>>>                             "
>>> http://schemas.microsoft.com/sharepoint/soap/directory/");
>>>                         XmlNode getListItemsNode =
>>> doc.CreateElement("GetListItemsResponse");
>>>
>>>                         uint counter = 0;
>>>                         do
>>>                         {
>>>                             if (counter >= startRowParam + rowLimitParam)
>>>                                 break;
>>>
>>>                             SPListItemCollection collListItems =
>>> oList.GetItems(listQuery);
>>>
>>>
>>>                             foreach (SPListItem oListItem in
>>> collListItems)
>>>                             {
>>>                                 if (counter >= startRowParam && counter
>>> < startRowParam + rowLimitParam)
>>>                                 {
>>>                                     XmlNode resultNode =
>>> doc.CreateElement("GetListItemsResult");
>>>                                     XmlAttribute idAttribute =
>>> doc.CreateAttribute("FileRef");
>>>                                     idAttribute.Value = oListItem.Url;
>>>
>>> resultNode.Attributes.Append(idAttribute);
>>>                                     XmlAttribute urlAttribute =
>>> doc.CreateAttribute("ListItemURL");
>>>                                     //urlAttribute.Value =
>>> oListItem.ParentList.DefaultViewUrl;
>>>                                     urlAttribute.Value =
>>> string.Format("{0}?ID={1}",
>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>> oListItem.ID);
>>>
>>> resultNode.Attributes.Append(urlAttribute);
>>>
>>> getListItemsNode.AppendChild(resultNode);
>>>                                 }
>>>                                 counter++;
>>>                             }
>>>
>>>                             listQuery.ListItemCollectionPosition =
>>> collListItems.ListItemCollectionPosition;
>>>
>>>                         } while (listQuery.ListItemCollectionPosition !=
>>> null);
>>>
>>>                         retVal.AppendChild(getListItemsNode);
>>> <<<<<<
>>>
>>> The code is clearly working if you get 20000 results returned, so I
>>> submit that perhaps there's a configured limit in your SharePoint instance
>>> that prevents listing more than 20000.  That's the only way I can explain
>>> this.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <jalongar@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> The job finnish ok (several times) but always with this 20000
>>>> documents, for some reason the loop only execute twice
>>>>
>>>> Jorge Alonso Garcia
>>>>
>>>>
>>>>
>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddywri@gmail.com>)
>>>> escribió:
>>>>
>>>>> If the are all in one document, then you'd be running this code:
>>>>>
>>>>> >>>>>>
>>>>>         int startingIndex = 0;
>>>>>         int amtToRequest = 10000;
>>>>>         while (true)
>>>>>         {
>>>>>
>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>> itemsResult =
>>>>>
>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>
>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>
>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>> response: " + itemsList[0].toString());
>>>>>           }
>>>>>
>>>>>           if (itemsList.length != 1)
>>>>>             throw new ManifoldCFException("Bad response - expecting
>>>>> one outer 'GetListItems' node, saw "+Integer.toString(itemsList.length));
>>>>>
>>>>>           MessageElement items = itemsList[0];
>>>>>           if
>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>             throw new ManifoldCFException("Bad response - outer node
>>>>> should have been 'GetListItems' node");
>>>>>
>>>>>           int resultCount = 0;
>>>>>           Iterator iter = items.getChildElements();
>>>>>           while (iter.hasNext())
>>>>>           {
>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>             if
>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>             {
>>>>>               Iterator resultIter = child.getChildElements();
>>>>>               while (resultIter.hasNext())
>>>>>               {
>>>>>                 MessageElement result =
>>>>> (MessageElement)resultIter.next();
>>>>>                 if
>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>                 {
>>>>>                   resultCount++;
>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>                   String displayURL =
>>>>> result.getAttribute("ListItemURL");
>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>                 }
>>>>>               }
>>>>>
>>>>>             }
>>>>>           }
>>>>>
>>>>>           if (resultCount < amtToRequest)
>>>>>             break;
>>>>>
>>>>>           startingIndex += resultCount;
>>>>>         }
>>>>> <<<<<<
>>>>>
>>>>> What this does is request library content URLs in chunks of 10000.  It
>>>>> stops when it receives less than 10000 documents from any one request.
>>>>>
>>>>> If the documents were all in one library, then one call to the web
>>>>> service yielded 10000 documents, and the second call yielded 10000
>>>>> documents, and there was no third call for no reason I can figure out.
>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>> continue, unless there was some kind of error.  Does the job succeed,
or
>>>>> does it abort?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>> version of Sharepoint in the connection configuration, there is no
hard
>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other
people
>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>
>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>> SharePoint version, then sure, that would be expected behavior. 
So please
>>>>>> check that first.
>>>>>>
>>>>>> The other question I have is your description of you first getting
>>>>>> 10001 documents and then later 20002.  That's not how ManifoldCF
works.  At
>>>>>> the start of the crawl, seeds are added; this would start out just
being
>>>>>> the root, and then other documents would be discovered as the crawl
>>>>>> proceeded, after subsites and libraries are discovered.  So I am
still
>>>>>> trying to square that with your description of how this is working
for you.
>>>>>>
>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>> jalongar@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
after
>>>>>>> sometime of process raise to 20,002) .
>>>>>>> It looks like a hard limit, there is more files on sharepoint
with
>>>>>>> the used criteria
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<daddywri@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> Hi Jorge,
>>>>>>>>
>>>>>>>> When you run the job, do you see more than 20,000 documents
as part
>>>>>>>> of it?
>>>>>>>>
>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>
>>>>>>>> Unless you are seeing a hard number like that in the UI for
that
>>>>>>>> job on the job status page, I doubt very much that the problem
is a
>>>>>>>> numerical limitation in the number of documents.  I would
suspect that the
>>>>>>>> inclusion criteria, e.g. the mime type or maximum length,
is excluding
>>>>>>>> documents.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute
the
>>>>>>>>> jon only extract these 20,000. How Can I check where
is the issue?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>
>>>>>>>>>> By "stop at 20,000" do you mean that it finds more
than 20,000
>>>>>>>>>> but stops crawling at that time?  Or what exactly
do you mean here?
>>>>>>>>>>
>>>>>>>>>> FWIW, the behavior you describe sounds like you may
not have
>>>>>>>>>> installed the SharePoint plugin and may have selected
a version of
>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint
versions after 2008 limit
>>>>>>>>>> the number of documents returned using the standard
web services methods.
>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia
<
>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>> There is a job that crawl a sharepoint 2016,
but it is not
>>>>>>>>>>> recovering all files, it stop at 20.000 documents
without any error.
>>>>>>>>>>> Is there any parameter that should be change
to avoid this
>>>>>>>>>>> limitation?
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>
>>>>>>>>>>>

Mime
View raw message