manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav G <goyalgaur...@gmail.com>
Subject Re: Sharepoint Crawl - Missing documents
Date Wed, 06 Mar 2019 05:38:57 GMT
Hi Karl,

There are no subsites as such. It is one big library with all documents in
it in a flat structure. The same goes for the list.
We enabled the logging for the connector and ran the list job. Below is the
exception that it throws after it has crawled the list partially. It looks
like after it gets this exception it tries to start over from the beginning
and tries to do that a few times and then quits.

DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking
whether to include list item '/CONTENT/145120_.000'
DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking
whether to include list item '/CONTENT/145121_.000'
DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking
whether to include list item '/CONTENT/145122_.000'
DEBUG 2019-03-05T23:50:15,599 (Worker thread '6') - SharePoint: Got an
unknown remote exception getting child documents for site  guid
{A6079591-4150-410E-9C12-B5CAEF02D400} - axis fault = Server.userException,
detail = org.xml.sax.SAXException: Processing instructions are not allowed
within SOAP messages - retrying
org.apache.axis.AxisFault: ; nested exception is:
        org.xml.sax.SAXException: Processing instructions are not allowed
within SOAP messages
        at org.apache.axis.AxisFault.makeFault(AxisFault.java:101)
~[axis-1.4.jar:?]
        at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:701)
~[axis-1.4.jar:?]
        at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
~[axis-1.4.jar:?]
        at
org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstandChecker.java:62)
~[axis-1.4.jar:?]
        at org.apache.axis.client.AxisClient.invoke(AxisClient.java:206)
~[axis-1.4.jar:?]
        at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
~[axis-1.4.jar:?]
        at org.apache.axis.client.Call.invoke(Call.java:2767)
~[axis-1.4.jar:?]
        at org.apache.axis.client.Call.invoke(Call.java:2443)
~[axis-1.4.jar:?]
        at org.apache.axis.client.Call.invoke(Call.java:2366)
~[axis-1.4.jar:?]
        at org.apache.axis.client.Call.invoke(Call.java:1812)
~[axis-1.4.jar:?]
        at
com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234)
~[mcf-sharepoint-connector.jar:?]
        at
org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:661)
[mcf-sharepoint-connector.jar:?]
        at
org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:898)
[mcf-sharepoint-connector.jar:?]
        at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[mcf-pull-agent.jar:?]
Caused by: org.xml.sax.SAXException: Processing instructions are not
allowed within SOAP messages
        at
org.apache.axis.encoding.DeserializationContext.startDTD(DeserializationContext.java:1161)
~[?:?]
        at org.apache.xerces.parsers.AbstractSAXParser.doctypeDecl(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.impl.dtd.XMLDTDValidator.doctypeDecl(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at
org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at
org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source) ~[xercesImpl-2.10.0.jar:?]
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
~[xercesImpl-2.10.0.jar:?]
        at
org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
~[?:?]
        at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
~[?:?]
        ... 12 more
 WARN 2019-03-05T23:50:15,602 (Worker thread '6') - Service interruption
reported for job 1551357423253 connection 'Finance Test List': Remote
procedure exception: ; nested exception is:
        org.xml.sax.SAXException: Processing instructions are not allowed
within SOAP messages


Thanks,
Gaurav

On Mon, Mar 4, 2019 at 5:11 PM Karl Wright <daddywri@gmail.com> wrote:

> Hi Gaurav,
> There is no document count threshold value.
> If you can identify libraries or subsites that aren't being crawled, you
> can turn on connector debugging to see why the connector is skipping them.
> There could be many reasons for a library or site to be skipped, e.g. bad
> specification rules, or permissions insufficient to read them.
>
> Karl
>
>
> On Mon, Mar 4, 2019 at 4:03 AM Gaurav G <goyalgauravg@gmail.com> wrote:
>
>> Hi,
>>
>> We are trying to crawl a Sharepoint list with about 150,000 items and a
>> library with about 125,000 documents.
>> We have separate jobs for both. The list job only crawls about 50000
>> items and completes cleanly while the library job crawls about 40000
>> documents and completes cleanly.
>> We are trying to figure out why we are not getting the complete list. Is
>> there a threshold value beyond which the crawling doesn't happen.
>> For smaller repos (<30000 items) we are not facing any issue. Those get
>> crawled completely.
>>
>> Thanks,
>> Gaurav
>>
>>

Mime
View raw message