manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Alonso Garcia <jalon...@gmail.com>
Subject Re: sharepoint crawler documents limit
Date Thu, 19 Dec 2019 17:51:15 GMT
Hi,
The job finnish ok (several times) but always with this 20000 documents,
for some reason the loop only execute twice

Jorge Alonso Garcia



El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddywri@gmail.com>)
escribió:

> If the are all in one document, then you'd be running this code:
>
> >>>>>>
>         int startingIndex = 0;
>         int amtToRequest = 10000;
>         while (true)
>         {
>
> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
> itemsResult =
>
> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>
>           MessageElement[] itemsList = itemsResult.get_any();
>
>           if (Logging.connectors.isDebugEnabled()){
>             Logging.connectors.debug("SharePoint: getChildren xml
> response: " + itemsList[0].toString());
>           }
>
>           if (itemsList.length != 1)
>             throw new ManifoldCFException("Bad response - expecting one
> outer 'GetListItems' node, saw "+Integer.toString(itemsList.length));
>
>           MessageElement items = itemsList[0];
>           if
> (!items.getElementName().getLocalName().equals("GetListItems"))
>             throw new ManifoldCFException("Bad response - outer node
> should have been 'GetListItems' node");
>
>           int resultCount = 0;
>           Iterator iter = items.getChildElements();
>           while (iter.hasNext())
>           {
>             MessageElement child = (MessageElement)iter.next();
>             if
> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>             {
>               Iterator resultIter = child.getChildElements();
>               while (resultIter.hasNext())
>               {
>                 MessageElement result = (MessageElement)resultIter.next();
>                 if
> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>                 {
>                   resultCount++;
>                   String relPath = result.getAttribute("FileRef");
>                   String displayURL = result.getAttribute("ListItemURL");
>                   fileStream.addFile( relPath, displayURL );
>                 }
>               }
>
>             }
>           }
>
>           if (resultCount < amtToRequest)
>             break;
>
>           startingIndex += resultCount;
>         }
> <<<<<<
>
> What this does is request library content URLs in chunks of 10000.  It
> stops when it receives less than 10000 documents from any one request.
>
> If the documents were all in one library, then one call to the web service
> yielded 10000 documents, and the second call yielded 10000 documents, and
> there was no third call for no reason I can figure out.  Since 10000
> documents were returned each time the loop ought to just continue, unless
> there was some kind of error.  Does the job succeed, or does it abort?
>
> Karl
>
>
> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> If you are using the MCF plugin, and selecting the appropriate version of
>> Sharepoint in the connection configuration, there is no hard limit I'm
>> aware of for any Sharepoint job.  We have lots of other people using
>> SharePoint and nobody has reported this ever before.
>>
>> If your SharePoint connection says "SharePoint 2003" as the SharePoint
>> version, then sure, that would be expected behavior.  So please check that
>> first.
>>
>> The other question I have is your description of you first getting 10001
>> documents and then later 20002.  That's not how ManifoldCF works.  At the
>> start of the crawl, seeds are added; this would start out just being the
>> root, and then other documents would be discovered as the crawl proceeded,
>> after subsites and libraries are discovered.  So I am still trying to
>> square that with your description of how this is working for you.
>>
>> Are all of your documents in one library?  Or two libraries?
>>
>> Karl
>>
>>
>>
>>
>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <jalongar@gmail.com>
>> wrote:
>>
>>> Hi,
>>> On UI shows 20,002 documents (on a firts phase show 10,001,and after
>>> sometime of process raise to 20,002) .
>>> It looks like a hard limit, there is more files on sharepoint with the
>>> used criteria
>>>
>>>
>>> Jorge Alonso Garcia
>>>
>>>
>>>
>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<daddywri@gmail.com>)
>>> escribió:
>>>
>>>> Hi Jorge,
>>>>
>>>> When you run the job, do you see more than 20,000 documents as part of
>>>> it?
>>>>
>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>
>>>> Unless you are seeing a hard number like that in the UI for that job on
>>>> the job status page, I doubt very much that the problem is a numerical
>>>> limitation in the number of documents.  I would suspect that the inclusion
>>>> criteria, e.g. the mime type or maximum length, is excluding documents.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <jalongar@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>> We had installed the shaterpoint plugin, and access properly
>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Sharepoint has more than 20,000 documents, but when execute the jon
>>>>> only extract these 20,000. How Can I check where is the issue?
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<daddywri@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> By "stop at 20,000" do you mean that it finds more than 20,000 but
>>>>>> stops crawling at that time?  Or what exactly do you mean here?
>>>>>>
>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>> installed the SharePoint plugin and may have selected a version of
>>>>>> SharePoint that is inappropriate.  All SharePoint versions after
2008 limit
>>>>>> the number of documents returned using the standard web services
methods.
>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>> jalongar@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> We have an isuse with sharepoint connector.
>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>> recovering all files, it stop at 20.000 documents without any
error.
>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>> limitation?
>>>>>>>
>>>>>>> Regards
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>

Mime
View raw message