manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Http status code 302
Date Wed, 09 Jan 2013 15:02:03 GMT
There seems to be only two differences.  The Host header value is
different, and there is an Accept header in the one that works.
(Accept: */*)

I will experiment with curl this evening to see which of these is
causing the problem.  Or, if you don't want to wait, you can use curl
and explicitly set these headers to see which one causes it to fail.

Thanks,
Karl


On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
<shinichiro.abe.1@gmail.com> wrote:
> Thank you for your navigation.
> I got a log from MCF 1.0.1.
>
> A) a log from curl
>
> curl -vvv "http://lucene.jugem.jp/?eid=39"
> * About to connect() to lucene.jugem.jp port 80 (#0)
> *   Trying 210.172.160.170... connected
> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>> GET /?eid=39 HTTP/1.1
>> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r
zlib/1.2.3
>> Host: lucene.jugem.jp
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Date: Wed, 09 Jan 2013 13:23:15 GMT
> < Server: Apache/2.0.59 (Unix)
> < Vary: User-Agent,Host,Accept-Encoding
> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
> < Accept-Ranges: bytes
> < Content-Length: 22594
> < Cache-Control: private
> < Pragma: no-cache
> < Connection: close
> < Content-Type: text/html
>
>
> B) a log from MCF 1.0.1
>
> DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 210.172.160.170:80
> DEBUG 2013-01-09 23:40:11,436 (Thread-472) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: lucene.jugem.jp
> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
shinichiro.abe.1@gmail.com)[\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "Host: lucene.jugem.jp[\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "[\r][\n]"
> DEBUG 2013-01-09 23:40:11,629 (Thread-472) - << "HTTP/1.1 200 OK[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Date: Wed, 09 Jan 2013 14:39:24
GMT[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Server: Apache/2.0.59 (Unix)[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Vary: User-Agent,Host,Accept-Encoding[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Last-Modified: Tue, 08 Jan 2013
07:58:33 GMT[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Accept-Ranges: bytes[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Length: 22594[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Cache-Control: private[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Pragma: no-cache[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Connection: close[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Type: text/html[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "[\r][\n]"
> DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection in response
to directive: close
>
> Is it enough to diagnose?
>
> Thank you very much,
> Shinichiro
>
>
>
>
> On 2013/01/09, at 23:12, Karl Wright wrote:
>
>> Wire debugging with MCF 1.0.1 requires different logging.ini
>> parameters, because it uses commons-httpclient instead.  That's
>> described here:
>>
>> http://hc.apache.org/httpclient-3.x/logging.html
>>
>> I will need a working comparison to diagnose what is happening, so
>> please either get a log from curl, or better yet from MCF 1.0.1.
>>
>> Thanks!
>> Karl
>>
>>
>> On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
>> <shinichiro.abe.1@gmail.com> wrote:
>>> Hi,
>>>
>>> I did wire debugging:
>>> curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200.
>>>
>>> The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.
>>>
>>> [1]
>>> DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1
>>> DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com)[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: lucene.jugem.jp:80[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: Keep-Alive[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com)
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: shinichiro.abe.1@gmail.com
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive
>>> DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013
13:06:39 GMT[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59
(Unix)[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: http://error.jugem.jp/[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 285[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html;
charset=iso-8859-1[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 302
Found
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013
13:06:39 GMT
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 (Unix)
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: http://error.jugem.jp/
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285
>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close
>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html;
charset=iso-8859-1
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML PUBLIC
"-//IETF//DTD HTML 2.0//EN">[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 Found</title>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document
has moved <a href="http://error.jugem.jp/">here</a>.</p>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]"
>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59
(Unix) Server at lucene.jugem.jp Port 80</address>[\n]"
>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]"
>>> DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 0.0.0.0:56784<->210.172.160.170:80
closed
>>>
>>>
>>>
>>> Hmm.. It looks like moving to the error location anyway.
>>>
>>> Thanks,
>>> Shinichiro Abe
>>>
>>>
>>> On 2013/01/09, at 21:08, Karl Wright wrote:
>>>
>>>> Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
>>>> Koji's blog site does not like one of the headers, crawler-agent
>>>> perhaps?
>>>>
>>>> I am behind a firewall now but I will explore this later today.  In
>>>> the meantime, if you want to research the problem, could you turn on
>>>> wire debugging?  You do this in the logging.ini file following these
>>>> instructions:
>>>>
>>>> http://hc.apache.org/httpcomponents-client-ga/logging.html
>>>>
>>>> You should see everything happening in the log then, and you can then
>>>> compare against curl using -vvv.  Please let me know what you find.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
>>>> <shinichiro.abe.1@gmail.com> wrote:
>>>>> I'm using web connector.
>>>>>
>>>>>> Are you trying to crawl through a proxy?
>>>>> No. I just set seeds that url without a proxy.
>>>>> (Also I didn't obey robots.txt)
>>>>>
>>>>> Using curl, it is the same as your result.
>>>>>
>>>>> Could you reproduce that?
>>>>>
>>>>> Shinichiro
>>>>>
>>>>> On 2013/01/09, at 17:49, Karl Wright wrote:
>>>>>
>>>>>> When I try the URL you gave using curl and no special arguments,
I get this:
>>>>>>
>>>>>>
>>>>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39"
>>>>>> * About to connect() to lucene.jugem.jp port 80 (#0)
>>>>>> *   Trying 210.172.160.170... connected
>>>>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>>>>>> GET /?eid=39 HTTP/1.1
>>>>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c
zlib/1.2
>>>>>> .5 librtmp/2.3
>>>>>>> Host: lucene.jugem.jp
>>>>>>> Accept: */*
>>>>>>>
>>>>>> < HTTP/1.1 200 OK
>>>>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT
>>>>>> < Server: Apache/2.0.59 (Unix)
>>>>>> < Vary: User-Agent,Host,Accept-Encoding
>>>>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>>>>>> < Accept-Ranges: bytes
>>>>>> < Content-Length: 22594
>>>>>> < Cache-Control: private
>>>>>> < Pragma: no-cache
>>>>>> < Connection: close
>>>>>> < Content-Type: text/html
>>>>>>
>>>>>> There's no 302 from here.
>>>>>>
>>>>>> Are you trying to crawl through a proxy?  If so, that might be where
>>>>>> the problem lies.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> It sounds like the httpclient upgrade definitely broke something.
 We
>>>>>>> should open a ticket.
>>>>>>>
>>>>>>> But first, can you confirm what connector this is?  Is it the
web
>>>>>>> connector?  If so, I am puzzled because the web connector has
always
>>>>>>> logged any 302 return, but then queued a second document which
it
>>>>>>> subsequently fetches.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
>>>>>>> <shinichiro.abe.1@gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm using trunk code and crawling web site with seeds which
have http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
>>>>>>>> As I'm look at Simple History, it shows 302 result code at
fetch activity and doesn't ingest document.
>>>>>>>>
>>>>>>>> When I used MCF 1.0.1 in the same situation, Simple History
showed 200 result code and MCF could ingest documents.
>>>>>>>>
>>>>>>>> Why does the trunk shows 302 status? Is it relevant to upgrading
httpclient?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Shinichiro Abe
>>>>>
>>>
>

Mime
View raw message