manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Sharepoint Crawl - Missing documents
Date Wed, 06 Mar 2019 16:09:24 GMT
The SharePoint connector requests documents in chunks of size 10,000.  The
request you point at gets the documents from row 50,000 through 60,000.

The error text (if that is related to this request) shows that the request
is timing out because SharePoint is not responding in a timely manner.  I
wonder if there's a problem with memory allocation on that machine?

Karl


On Wed, Mar 6, 2019 at 11:01 AM Gaurav G <goyalgauravg@gmail.com> wrote:

> Hi Karl,
>
> On further digging in the Manifold log, I found the following lines..Do
> they point to any possible reason...
> We are working on getting the web service specific logs enabled in
> Sharepoint. Also wanted to check if the Manifold sharepoint plugin prints
> any logs..
>
> DEBUG *2019-03-06T17:25:06,833* (Thread-6086051) - http-outgoing-150056
> >> "<?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soapenv="
> *http://schemas.xmlsoap.org/soap/envelope/*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__schemas.xmlsoap.org_soap_envelope_&d=DwMBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=L3XDbgcveKXY09WuM4g0WK0Ca_8lalmsmiQPK25oTvA&m=xxh5A_i6IjyQVUX0-fNKyJ_UUDjmO1iYcIelg2QUkfI&s=5HtjavVNl_2lftA5IHFZmwS0QsPambn8o4yHvCenHyg&e=>"
> xmlns:xsd="*http://www.w3.org/2001/XMLSchema*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_2001_XMLSchema&d=DwMBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=L3XDbgcveKXY09WuM4g0WK0Ca_8lalmsmiQPK25oTvA&m=xxh5A_i6IjyQVUX0-fNKyJ_UUDjmO1iYcIelg2QUkfI&s=0xCiFRBCAuKfPuoecLlVmdN_9vlRPFnsphwOlWX6xFM&e=>"
> xmlns:xsi="*http://www.w3.org/2001/XMLSchema-instance*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_2001_XMLSchema-2Dinstance&d=DwMBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=L3XDbgcveKXY09WuM4g0WK0Ca_8lalmsmiQPK25oTvA&m=xxh5A_i6IjyQVUX0-fNKyJ_UUDjmO1iYcIelg2QUkfI&s=FN4KSO3iD4-sF4ob2G2YdSY1zOZfx5ppKxU700mPQmQ&e=>"><soapenv:Body><GetListItems
> xmlns="*http://microsoft.com/sharepoint/webpartpages/*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__microsoft.com_sharepoint_webpartpages_&d=DwMBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=L3XDbgcveKXY09WuM4g0WK0Ca_8lalmsmiQPK25oTvA&m=xxh5A_i6IjyQVUX0-fNKyJ_UUDjmO1iYcIelg2QUkfI&s=gp7TzLLWZMNsslNvWSHreEqEVvHINPmn6RGBYyWtOKs&e=>
> "><listName>{A6079591-4150-410E-9C12-B5CAEF02D400}</listName>
> *<startRow>50000</startRow><rowLimit>10000</rowLimit>*
> </GetListItems></soapenv:Body></soapenv:Envelope>"
> .....
> ......
> DEBUG 2019-03-06T17:26:59,942 (Thread-6086051) - http-outgoing-150056 <<
> "HTTP/1.1 302 Found[\r][\n]"
> DEBUG 2019-03-06T17:26:59,942 (Thread-6086051) - http-outgoing-150056 <<
> "Cache-Control: private, max-age=0[\r][\n]"
> DEBUG 2019-03-06T17:26:59,942 (Thread-6086051) - http-outgoing-150056 <<
> "Transfer-Encoding: chunked[\r][\n]"
> DEBUG 2019-03-06T17:26:59,942 (Thread-6086051) - http-outgoing-150056 <<
> "Location:
> *http://finance.mysite.in:9070/sites/Finance/_layouts/15/error.aspx?ErrorText=Request%20timed%20out%2E*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__vault.vodafone.in-3A9070_sites_Finance_-5Flayouts_15_error.aspx-3FErrorText-3DRequest-2520timed-2520out-252E&d=DwMBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=L3XDbgcveKXY09WuM4g0WK0Ca_8lalmsmiQPK25oTvA&m=xxh5A_i6IjyQVUX0-fNKyJ_UUDjmO1iYcIelg2QUkfI&s=zCG_yFCuE-cyRES6kPDH5JXbj300shEQ7bDzQzrX6uU&e=>
> [\r][\n]"
>
> Thanks,
>
> Gaurav
>
> On Wed, Mar 6, 2019 at 4:44 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Guarav,
>>
>> Then I don't understand what is wrong.  I've never seen this before, and
>> that was the only thing I could think of.  The only thing I can add is that
>> the problem is taking place on the SharePoint side, so maybe (as the error
>> suggests) it might be worth looking at the SharePoint server logs.
>>
>> Karl
>>
>>
>>
>> On Wed, Mar 6, 2019 at 5:42 AM Gaurav G <goyalgauravg@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> The Sharepoint version is 2013. I double checked. The version of the
>>> plugin that is installed on the server and the one in the connection
>>> configuration is all 2013.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>> On Wed, Mar 6, 2019 at 12:33 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Guarav,
>>>> Which version of SharePoint is this?  And, did you install the
>>>> SharePoint plugin for ManifoldCF, and select the correct versions of
>>>> SharePoint in the connection configuration?
>>>>
>>>> Versions of SharePoint after 2010 limiited the number of documents that
>>>> could be returned from the Lists service.  The MCF plugin for SharePoint
>>>> not only includes the ability to obtain user permissions, but also provides
>>>> our own implementation of Lists that is not so limited.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Mar 6, 2019 at 12:39 AM Gaurav G <goyalgauravg@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> There are no subsites as such. It is one big library with all
>>>>> documents in it in a flat structure. The same goes for the list.
>>>>> We enabled the logging for the connector and ran the list job. Below
>>>>> is the exception that it throws after it has crawled the list partially.
It
>>>>> looks like after it gets this exception it tries to start over from the
>>>>> beginning and tries to do that a few times and then quits.
>>>>>
>>>>> DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint:
>>>>> Checking whether to include list item '/CONTENT/145120_.000'
>>>>> DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint:
>>>>> Checking whether to include list item '/CONTENT/145121_.000'
>>>>> DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint:
>>>>> Checking whether to include list item '/CONTENT/145122_.000'
>>>>> DEBUG 2019-03-05T23:50:15,599 (Worker thread '6') - SharePoint: Got an
>>>>> unknown remote exception getting child documents for site  guid
>>>>> {A6079591-4150-410E-9C12-B5CAEF02D400} - axis fault = Server.userException,
>>>>> detail = org.xml.sax.SAXException: Processing instructions are not allowed
>>>>> within SOAP messages - retrying
>>>>> org.apache.axis.AxisFault: ; nested exception is:
>>>>>         org.xml.sax.SAXException: Processing instructions are not
>>>>> allowed within SOAP messages
>>>>>         at org.apache.axis.AxisFault.makeFault(AxisFault.java:101)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at
>>>>> org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:701)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at
>>>>> org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstandChecker.java:62)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at
>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:206)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>> ~[axis-1.4.jar:?]
>>>>>         at
>>>>> com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234)
>>>>> ~[mcf-sharepoint-connector.jar:?]
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:661)
>>>>> [mcf-sharepoint-connector.jar:?]
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:898)
>>>>> [mcf-sharepoint-connector.jar:?]
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>> [mcf-pull-agent.jar:?]
>>>>> Caused by: org.xml.sax.SAXException: Processing instructions are not
>>>>> allowed within SOAP messages
>>>>>         at
>>>>> org.apache.axis.encoding.DeserializationContext.startDTD(DeserializationContext.java:1161)
>>>>> ~[?:?]
>>>>>         at
>>>>> org.apache.xerces.parsers.AbstractSAXParser.doctypeDecl(Unknown Source)
>>>>> ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.xerces.impl.dtd.XMLDTDValidator.doctypeDecl(Unknown Source)
>>>>> ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>>>> ~[xercesImpl-2.10.0.jar:?]
>>>>>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
>>>>> Source) ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
>>>>> ~[xercesImpl-2.10.0.jar:?]
>>>>>         at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>>>>> ~[xercesImpl-2.10.0.jar:?]
>>>>>         at
>>>>> org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
>>>>> ~[?:?]
>>>>>         at
>>>>> org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696) ~[?:?]
>>>>>         ... 12 more
>>>>>  WARN 2019-03-05T23:50:15,602 (Worker thread '6') - Service
>>>>> interruption reported for job 1551357423253 connection 'Finance Test
List':
>>>>> Remote procedure exception: ; nested exception is:
>>>>>         org.xml.sax.SAXException: Processing instructions are not
>>>>> allowed within SOAP messages
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Gaurav
>>>>>
>>>>> On Mon, Mar 4, 2019 at 5:11 PM Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> Hi Gaurav,
>>>>>> There is no document count threshold value.
>>>>>> If you can identify libraries or subsites that aren't being crawled,
>>>>>> you can turn on connector debugging to see why the connector is skipping
>>>>>> them.  There could be many reasons for a library or site to be skipped,
>>>>>> e.g. bad specification rules, or permissions insufficient to read
them.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 4, 2019 at 4:03 AM Gaurav G <goyalgauravg@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are trying to crawl a Sharepoint list with about 150,000 items
>>>>>>> and a library with about 125,000 documents.
>>>>>>> We have separate jobs for both. The list job only crawls about
50000
>>>>>>> items and completes cleanly while the library job crawls about
40000
>>>>>>> documents and completes cleanly.
>>>>>>> We are trying to figure out why we are not getting the complete
>>>>>>> list. Is there a threshold value beyond which the crawling doesn't
happen.
>>>>>>> For smaller repos (<30000 items) we are not facing any issue.
Those
>>>>>>> get crawled completely.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gaurav
>>>>>>>
>>>>>>>

Mime
View raw message