manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Riethmuller <priethmul...@funnelback.com>
Subject Re: HTTP 302 error causing job to abort
Date Mon, 22 Feb 2016 22:09:42 GMT
Thanks Karl,

I¹ll take a look at this today.

Regards,

Phil Riethmuller
Technical Consultant
 
Funnelback | 437 Kent Street, Sydney, NSW 2000
T +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>

AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES


Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
Twitter


From:  Karl Wright <daddywri@gmail.com>
Reply-To:  <user@manifoldcf.apache.org>
Date:  Monday, 22 February 2016 11:32 pm
To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Any news on this research?
Karl


On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Phil,
> 
> Thanks -- this information is more helpful.
> 
> So my understanding is that there is an external site reference in your
> site/subsite hierarchy?  And the *root* site (the one that you point at when
> you configure the connection itself) is *not* external after all?
> 
> If that is the case, then the external site must be being "discovered" through
> the Webs service API call.  There are two ways forward:
> 
> (1)  We can change the Webs response parsing to detect external sites and not
> include those in the crawl, or
> (2) We can try to make decisions based on whether a 302 comes back as a
> response code.
> 
> (1) is by far the best approach but it will require some cooperation and
> execution of sample code on your part.  Essentially I'll need to see what the
> xml is that is coming back that first describes the exterrnal site and see if
> there is an attribute that lets us know it is external.  That way I properly
> just skip it entirely.
> 
> We can have a look at what comes back from SharePoint for this API response if
> you enable connector debugging in properties.xml:
> 
> <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>
> 
> ... and restart.  You will then need to do a crawl.  The following line will
> be what you look for:
> 
> Logging.connectors.debug("SharePoint: getSites xml response: "+xmlResponse);
> 
> This xml response will contain "Url" and "Title" nodes; what I need to know is
> whether there's any attribute of the "Url" node, or parallel node other than
> "Url" or "Title', that contains an indication of whether the Url that
> describes the external site is indeed external.  So you look for the Url that
> describes the SharePoint URL that has the redirection, and tell me if there's
> anything special about it in the associated getSites response.  Does that make
> sense?
> 
> If this is too hard, alternative (2) is possible, but it will require tons of
> individual changes.  So let's look into (1) first.
> 
> Thanks
> Karl
> 
> 
> On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller
> <priethmuller@funnelback.com> wrote:
>> Hi Karl,
>> 
>> Some further info:
>> * The problem document that Manifold reported, is redirecting to an external
>> site.
>> * We tried crawling a smaller subset of content on the same Sharepoint site
>> that definitely doesn¹t contain any external links in the content, and this
>> works OK. 
>> * The job that errors with the 302, says it has found 529 docs so far and
>> processed 127 of them. This seems to indicate that is has in fact found some
>> documents.
>> I¹m not sure what you mean that the error is being generated from the API
>> call, and not an individual document? The info appears to indicate it is not
>> all documents, but just selected documents.
>> 
>> There really isn¹t much we can do about this from the Sharepoint
>> configuration side, is there any way we can test if it is as simple as the
>> 302 coming from the documents themselves?
>> 
>> Thanks for your help to date.
>> 
>> Phil
>> 
>> 
>> From:  Karl Wright <daddywri@gmail.com>
>> Reply-To:  <user@manifoldcf.apache.org>
>> Date:  Thursday, 18 February 2016 10:31 am
>> 
>> To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> Subject:  Re: HTTP 302 error causing job to abort
>> 
>> Hi Phil,
>> 
>> The 302 error is not coming from a single document.  If it *was* coming from
>> the fetch of an individual document, it would be easy to work around.  But,
>> from your stack trace, it is clear that this error is coming from an API
>> call, specifically a call to enumerate subsites of a given site.  That means
>> that some or all of the SharePoint hierarchy is not accessible through POST
>> requests.  I have never seen this kind of behavior from SharePoint before.
>> 
>> This is not something that I can work around without more information.  In
>> order to get that information, you will at the very minimum need to turn on
>> connector debugging, and probably turning on http wire debugging would be
>> helpful too.  And, if what you said about the View page for this connection
>> is true and it also shows a 302 error, I very much suspect that something
>> changed on the server end and you are currently unable to crawl *any*
>> documents at all.
>> 
>> I am sorry I cannot make this any clearer.
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> 
>> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller
>> <priethmuller@funnelback.com> wrote:
>>> Hi Karl,
>>> 
>>> Thanks for the update.
>>> 
>>> I¹m not 100% sure how many documents have this redirect in them, but I¹ll
>>> see if I can get a better estimate. The content we are crawling is
>>> substantially large, and comes from many different authors so it¹s difficult
>>> to manage how these Sharepoint documents are created. It makes it extremely
>>> difficult to pinpoint all the documents that contain redirects.
>>> 
>>> Am I correct in assuming a single 302 error causes the job to fail, or is
>>> there some other logic that determines this?
>>> 
>>> How plausible would it be to include in the product an option for treating
>>> 302¹s as a warning, rather than a fatal error? Possibly just an option in
>>> the Job setup?
>>> 
>>> Regards,
>>> Phil
>>> 
>>> 
>>> From:  Karl Wright <daddywri@gmail.com>
>>> Reply-To:  <user@manifoldcf.apache.org>
>>> Date:  Thursday, 18 February 2016 1:39 am
>>> 
>>> To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>> Subject:  Re: HTTP 302 error causing job to abort
>>> 
>>> Hi again Phil,
>>> 
>>> The HttpClient team points out that POST requests (as we do for the
>>> SharePoint repository requests) are not allowed to follow 302 redirections
>>> according to RFC2616.  We use POST requests because, for SOAP, there is
>>> often quite a bit of XML data that goes along with the request, and we would
>>> otherwise have size issues.  So we cannot use GET instead of POST.  See
>>> CONNECTORS-1279 for details.
>>> 
>>> If you still believe that it is only a couple of URLs that are returning 302
>>> for you, I'd like some analysis of why you believe that to be true.  I would
>>> be happy to consider recognition of an occasional 302 response as meaning
>>> "skip this document".  On the other hand, based on your stack trace, it
>>> really appears that you have a far more systemic problem; it is failing
>>> while obtaining information for an entire site, so not much would get
>>> crawled in that case.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>> Hi Phil,
>>>> 
>>>> It is not surprising that the connector doesn't like 302 responses and
>>>> doesn't know what to do with them, because it isn't supposed to ever be
>>>> getting any of these.
>>>> 
>>>> I am puzzled by your statement that "only a couple of documents have
>>>> redirections in them", because the connector crawls Lists and Library
>>>> documents within SharePoint *only*, and these are very specifically
>>>> accessible through a SharePoint URL hierarchy structure.  There's no room
>>>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>>>> feel pretty certain you have a problem with your configuration and it is
>>>> not just "a couple of documents".
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller
>>>> <priethmuller@funnelback.com> wrote:
>>>>> Thanks Karl,
>>>>> 
>>>>> The majority of content is not going to the redirect, it¹s probably
just a
>>>>> handful of documents that are behaving this way.
>>>>> 
>>>>> I¹d agree that it¹s of lesser concern whether or not the document itself
>>>>> is indexing, however I wouldn¹t expect the 302 to be treated as a fatal
>>>>> error that causes the job to come to a halt. I¹d expect the document
to be
>>>>> passed over, and the crawl to continue.
>>>>> 
>>>>> Is the only solution at this point to remove the documents which redirect
>>>>> to a 302 to get the crawl to run in full?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Phil Riethmuller
>>>>> Technical Consultant
>>>>>  
>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>>> <http://www.funnelback.com/>
>>>>> 
>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>> 
>>>>> 
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
 -
>>>>> Twitter
>>>>> 
>>>>> 
>>>>> From:  Karl Wright <daddywri@gmail.com>
>>>>> Reply-To:  <user@manifoldcf.apache.org>
>>>>> Date:  Wednesday, 17 February 2016 8:58 am
>>>>> 
>>>>> To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>> 
>>>>> Hi Phil,
>>>>> 
>>>>> You probably want to point your SharePoint repository connection to the
>>>>> proper server and site, and not rely on redirections.  It's also possible
>>>>> that you are missing the site entirely and the redirection you are seeing
>>>>> is taking you to some error page somewhere.
>>>>> 
>>>>> I will be raising the question of redirections with the
>>>>> HttpComponents/HttpClient team, since I see no obvious problems with
the
>>>>> SharePoint connector code.  However, if your connection is properly set
>>>>> up, redirections should be unneeded.
>>>>> 
>>>>> I would read the documentation on the Wiki page for debugging SharePoint
>>>>> connections at the bottom of this page:
>>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connectio
>>>>> ns
>>>>> 
>>>>> Thanks,
>>>>> Karl
>>>>> 
>>>>> 
>>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller
>>>>> <priethmuller@funnelback.com> wrote:
>>>>>> Do you mean in the job status in the Manifold CF interface?
>>>>>> 
>>>>>> The job status also shows the same:
>>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>>>> (302)HTTP/1.0 302 Found
>>>>>> 
>>>>>> I agree, I wouldn¹t of thought that the crawler would follow any
links or
>>>>>> redirections.
>>>>>> 
>>>>>> What sort of configurations could be incorrectly configured, that
I could
>>>>>> look at revising?
>>>>>> 
>>>>>> Phil
>>>>>> 
>>>>>> 
>>>>>> From:  Karl Wright <daddywri@gmail.com>
>>>>>> Reply-To:  <user@manifoldcf.apache.org>
>>>>>> Date:  Wednesday, 17 February 2016 8:45 am
>>>>>> 
>>>>>> To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> When you view the repository connection in the UI, do you get a 302
error
>>>>>> also?
>>>>>> 
>>>>>> I have looked at the code; Httpclient is supposedly configured to
honor
>>>>>> redirections.  Obviously it is not doing that, so I'll have to dig
deeper
>>>>>> into why that is.  On the other hand, I would not expect you to be
>>>>>> getting any redirections, unless you have configured your connection
>>>>>> incorrectly.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
>>>>>> <priethmuller@funnelback.com> wrote:
>>>>>>> Thanks Karl -
>>>>>>> 
>>>>>>> I¹ve replaced the actual URL with <URL> below, but here
is the stack
>>>>>>> trace:
>>>>>>> 
>>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception
tossed:
>>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>>>> (302)HTTP/1.0 302 Found
>>>>>>> 
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
302
>>>>>>> Found
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi
>>>>>>> tes(SPSProxyHelper.java:2246)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository
>>>>>>> .processDocuments(SharePointRepository.java:1549)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:
>>>>>>> 399)
>>>>>>> 
>>>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Co
>>>>>>> mmonsHTTPSender.java:201)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.j
>>>>>>> ava:32)
>>>>>>> 
>>>>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>>> 
>>>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>>> 
>>>>>>>         at 
>>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(Webs
>>>>>>> SoapStub.java:854)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi
>>>>>>> tes(SPSProxyHelper.java:2161)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Phil Riethmuller
>>>>>>> Technical Consultant
>>>>>>>  
>>>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>>>>> <http://www.funnelback.com/>
>>>>>>> 
>>>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>>> 
>>>>>>> 
>>>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>>> - Twitter
>>>>>>> 
>>>>>>> 
>>>>>>> From:  Karl Wright <daddywri@gmail.com>
>>>>>>> Reply-To:  <user@manifoldcf.apache.org>
>>>>>>> Date:  Tuesday, 16 February 2016 6:54 pm
>>>>>>> To:  "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>>>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>>>> 
>>>>>>> Hi Phil,
>>>>>>> 
>>>>>>> A HTTP 302 response is simply a redirection.  It should not,
by itself,
>>>>>>> cause a job to abort.  I would expect that to go by in wire/http
>>>>>>> logging, but you should not see it anywhere else.  So it is not
clear to
>>>>>>> me what you are really seeing here.
>>>>>>> 
>>>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>>> 
>>>>>>> Karl
>>>>>>>  
>>>>>>> 
>>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
>>>>>>> <priethmuller@funnelback.com> wrote:
>>>>>>> Hi -
>>>>>>> 
>>>>>>> When crawling a Sharepoint repository, I¹m receiving a HTTP
302 error
>>>>>>> which is causing the manifold job to abort. How do I prevent
the crawler
>>>>>>> from aborting the job?
>>>>>>> 
>>>>>>> I¹m using v2.3 of Manifold with a postgres database.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Phil
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 




Mime
View raw message