manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Http status code 302
Date Wed, 09 Jan 2013 15:06:25 GMT
I created CONNECTORS-604 to track this problem.

Karl

On Wed, Jan 9, 2013 at 10:02 AM, Karl Wright <daddywri@gmail.com> wrote:
> There seems to be only two differences.  The Host header value is
> different, and there is an Accept header in the one that works.
> (Accept: */*)
>
> I will experiment with curl this evening to see which of these is
> causing the problem.  Or, if you don't want to wait, you can use curl
> and explicitly set these headers to see which one causes it to fail.
>
> Thanks,
> Karl
>
>
> On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
> <shinichiro.abe.1@gmail.com> wrote:
>> Thank you for your navigation.
>> I got a log from MCF 1.0.1.
>>
>> A) a log from curl
>>
>> curl -vvv "http://lucene.jugem.jp/?eid=39"
>> * About to connect() to lucene.jugem.jp port 80 (#0)
>> *   Trying 210.172.160.170... connected
>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>> GET /?eid=39 HTTP/1.1
>>> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r
zlib/1.2.3
>>> Host: lucene.jugem.jp
>>> Accept: */*
>>>
>> < HTTP/1.1 200 OK
>> < Date: Wed, 09 Jan 2013 13:23:15 GMT
>> < Server: Apache/2.0.59 (Unix)
>> < Vary: User-Agent,Host,Accept-Encoding
>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>> < Accept-Ranges: bytes
>> < Content-Length: 22594
>> < Cache-Control: private
>> < Pragma: no-cache
>> < Connection: close
>> < Content-Type: text/html
>>
>>
>> B) a log from MCF 1.0.1
>>
>> DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 210.172.160.170:80
>> DEBUG 2013-01-09 23:40:11,436 (Thread-472) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: lucene.jugem.jp
>> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
>> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
shinichiro.abe.1@gmail.com)[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "Host: lucene.jugem.jp[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,629 (Thread-472) - << "HTTP/1.1 200 OK[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Date: Wed, 09 Jan 2013 14:39:24
GMT[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Server: Apache/2.0.59 (Unix)[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Vary: User-Agent,Host,Accept-Encoding[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Last-Modified: Tue, 08 Jan
2013 07:58:33 GMT[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Accept-Ranges: bytes[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Length: 22594[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Cache-Control: private[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Pragma: no-cache[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Connection: close[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Type: text/html[\r][\n]"
>> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "[\r][\n]"
>> DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection in response
to directive: close
>>
>> Is it enough to diagnose?
>>
>> Thank you very much,
>> Shinichiro
>>
>>
>>
>>
>> On 2013/01/09, at 23:12, Karl Wright wrote:
>>
>>> Wire debugging with MCF 1.0.1 requires different logging.ini
>>> parameters, because it uses commons-httpclient instead.  That's
>>> described here:
>>>
>>> http://hc.apache.org/httpclient-3.x/logging.html
>>>
>>> I will need a working comparison to diagnose what is happening, so
>>> please either get a log from curl, or better yet from MCF 1.0.1.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
>>> <shinichiro.abe.1@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I did wire debugging:
>>>> curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got
a 200.
>>>>
>>>> The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.
>>>>
>>>> [1]
>>>> DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39
HTTP/1.1
>>>> DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com)[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: lucene.jugem.jp:80[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: Keep-Alive[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com)
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: shinichiro.abe.1@gmail.com
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80
>>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive
>>>> DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan
2013 13:06:39 GMT[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59
(Unix)[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: http://error.jugem.jp/[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 285[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html;
charset=iso-8859-1[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]"
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1
302 Found
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013
13:06:39 GMT
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59
(Unix)
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: http://error.jugem.jp/
>>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285
>>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close
>>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html;
charset=iso-8859-1
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML
PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 Found</title>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document
has moved <a href="http://error.jugem.jp/">here</a>.</p>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59
(Unix) Server at lucene.jugem.jp Port 80</address>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]"
>>>> DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 0.0.0.0:56784<->210.172.160.170:80
closed
>>>>
>>>>
>>>>
>>>> Hmm.. It looks like moving to the error location anyway.
>>>>
>>>> Thanks,
>>>> Shinichiro Abe
>>>>
>>>>
>>>> On 2013/01/09, at 21:08, Karl Wright wrote:
>>>>
>>>>> Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
>>>>> Koji's blog site does not like one of the headers, crawler-agent
>>>>> perhaps?
>>>>>
>>>>> I am behind a firewall now but I will explore this later today.  In
>>>>> the meantime, if you want to research the problem, could you turn on
>>>>> wire debugging?  You do this in the logging.ini file following these
>>>>> instructions:
>>>>>
>>>>> http://hc.apache.org/httpcomponents-client-ga/logging.html
>>>>>
>>>>> You should see everything happening in the log then, and you can then
>>>>> compare against curl using -vvv.  Please let me know what you find.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
>>>>> <shinichiro.abe.1@gmail.com> wrote:
>>>>>> I'm using web connector.
>>>>>>
>>>>>>> Are you trying to crawl through a proxy?
>>>>>> No. I just set seeds that url without a proxy.
>>>>>> (Also I didn't obey robots.txt)
>>>>>>
>>>>>> Using curl, it is the same as your result.
>>>>>>
>>>>>> Could you reproduce that?
>>>>>>
>>>>>> Shinichiro
>>>>>>
>>>>>> On 2013/01/09, at 17:49, Karl Wright wrote:
>>>>>>
>>>>>>> When I try the URL you gave using curl and no special arguments,
I get this:
>>>>>>>
>>>>>>>
>>>>>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39"
>>>>>>> * About to connect() to lucene.jugem.jp port 80 (#0)
>>>>>>> *   Trying 210.172.160.170... connected
>>>>>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>>>>>>> GET /?eid=39 HTTP/1.1
>>>>>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c
zlib/1.2
>>>>>>> .5 librtmp/2.3
>>>>>>>> Host: lucene.jugem.jp
>>>>>>>> Accept: */*
>>>>>>>>
>>>>>>> < HTTP/1.1 200 OK
>>>>>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT
>>>>>>> < Server: Apache/2.0.59 (Unix)
>>>>>>> < Vary: User-Agent,Host,Accept-Encoding
>>>>>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>>>>>>> < Accept-Ranges: bytes
>>>>>>> < Content-Length: 22594
>>>>>>> < Cache-Control: private
>>>>>>> < Pragma: no-cache
>>>>>>> < Connection: close
>>>>>>> < Content-Type: text/html
>>>>>>>
>>>>>>> There's no 302 from here.
>>>>>>>
>>>>>>> Are you trying to crawl through a proxy?  If so, that might be
where
>>>>>>> the problem lies.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> It sounds like the httpclient upgrade definitely broke something.
 We
>>>>>>>> should open a ticket.
>>>>>>>>
>>>>>>>> But first, can you confirm what connector this is?  Is it
the web
>>>>>>>> connector?  If so, I am puzzled because the web connector
has always
>>>>>>>> logged any 302 return, but then queued a second document
which it
>>>>>>>> subsequently fetches.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
>>>>>>>> <shinichiro.abe.1@gmail.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm using trunk code and crawling web site with seeds
which have http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
>>>>>>>>> As I'm look at Simple History, it shows 302 result code
at fetch activity and doesn't ingest document.
>>>>>>>>>
>>>>>>>>> When I used MCF 1.0.1 in the same situation, Simple History
showed 200 result code and MCF could ingest documents.
>>>>>>>>>
>>>>>>>>> Why does the trunk shows 302 status? Is it relevant to
upgrading httpclient?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Shinichiro Abe
>>>>>>
>>>>
>>

Mime
View raw message