manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Http status code 302
Date Wed, 09 Jan 2013 14:56:09 GMT
Thank you for your navigation. 
I got a log from MCF 1.0.1.

A) a log from curl

curl -vvv "http://lucene.jugem.jp/?eid=39"
* About to connect() to lucene.jugem.jp port 80 (#0)
*   Trying 210.172.160.170... connected
* Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
> GET /?eid=39 HTTP/1.1
> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
> Host: lucene.jugem.jp
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 09 Jan 2013 13:23:15 GMT
< Server: Apache/2.0.59 (Unix)
< Vary: User-Agent,Host,Accept-Encoding
< Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
< Accept-Ranges: bytes
< Content-Length: 22594
< Cache-Control: private
< Pragma: no-cache
< Connection: close
< Content-Type: text/html


B) a log from MCF 1.0.1

DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 210.172.160.170:80
DEBUG 2013-01-09 23:40:11,436 (Thread-472) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: lucene.jugem.jp
DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
shinichiro.abe.1@gmail.com)[\r][\n]"
DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "Host: lucene.jugem.jp[\r][\n]"
DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "[\r][\n]"
DEBUG 2013-01-09 23:40:11,629 (Thread-472) - << "HTTP/1.1 200 OK[\r][\n]"
DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Date: Wed, 09 Jan 2013 14:39:24 GMT[\r][\n]"
DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Server: Apache/2.0.59 (Unix)[\r][\n]"
DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Vary: User-Agent,Host,Accept-Encoding[\r][\n]"
DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Last-Modified: Tue, 08 Jan 2013 07:58:33
GMT[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Accept-Ranges: bytes[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Length: 22594[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Cache-Control: private[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Pragma: no-cache[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Connection: close[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Type: text/html[\r][\n]"
DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "[\r][\n]"
DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection in response to
directive: close

Is it enough to diagnose?

Thank you very much,
Shinichiro




On 2013/01/09, at 23:12, Karl Wright wrote:

> Wire debugging with MCF 1.0.1 requires different logging.ini
> parameters, because it uses commons-httpclient instead.  That's
> described here:
> 
> http://hc.apache.org/httpclient-3.x/logging.html
> 
> I will need a working comparison to diagnose what is happening, so
> please either get a log from curl, or better yet from MCF 1.0.1.
> 
> Thanks!
> Karl
> 
> 
> On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
> <shinichiro.abe.1@gmail.com> wrote:
>> Hi,
>> 
>> I did wire debugging:
>> curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200.
>> 
>> The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.
>> 
>> [1]
>> DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1
>> DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 HTTP/1.1[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
shinichiro.abe.1@gmail.com)[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: lucene.jugem.jp:80[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: Keep-Alive[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
shinichiro.abe.1@gmail.com)
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: shinichiro.abe.1@gmail.com
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80
>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive
>> DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013 13:06:39
GMT[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59 (Unix)[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: http://error.jugem.jp/[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 285[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html; charset=iso-8859-1[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]"
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 302 Found
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013 13:06:39
GMT
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 (Unix)
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: http://error.jugem.jp/
>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285
>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close
>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html; charset=iso-8859-1
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD
HTML 2.0//EN">[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 Found</title>[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document has
moved <a href="http://error.jugem.jp/">here</a>.</p>[\n]"
>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]"
>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59
(Unix) Server at lucene.jugem.jp Port 80</address>[\n]"
>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]"
>> DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 0.0.0.0:56784<->210.172.160.170:80
closed
>> 
>> 
>> 
>> Hmm.. It looks like moving to the error location anyway.
>> 
>> Thanks,
>> Shinichiro Abe
>> 
>> 
>> On 2013/01/09, at 21:08, Karl Wright wrote:
>> 
>>> Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
>>> Koji's blog site does not like one of the headers, crawler-agent
>>> perhaps?
>>> 
>>> I am behind a firewall now but I will explore this later today.  In
>>> the meantime, if you want to research the problem, could you turn on
>>> wire debugging?  You do this in the logging.ini file following these
>>> instructions:
>>> 
>>> http://hc.apache.org/httpcomponents-client-ga/logging.html
>>> 
>>> You should see everything happening in the log then, and you can then
>>> compare against curl using -vvv.  Please let me know what you find.
>>> 
>>> Thanks!
>>> Karl
>>> 
>>> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
>>> <shinichiro.abe.1@gmail.com> wrote:
>>>> I'm using web connector.
>>>> 
>>>>> Are you trying to crawl through a proxy?
>>>> No. I just set seeds that url without a proxy.
>>>> (Also I didn't obey robots.txt)
>>>> 
>>>> Using curl, it is the same as your result.
>>>> 
>>>> Could you reproduce that?
>>>> 
>>>> Shinichiro
>>>> 
>>>> On 2013/01/09, at 17:49, Karl Wright wrote:
>>>> 
>>>>> When I try the URL you gave using curl and no special arguments, I get
this:
>>>>> 
>>>>> 
>>>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39"
>>>>> * About to connect() to lucene.jugem.jp port 80 (#0)
>>>>> *   Trying 210.172.160.170... connected
>>>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>>>>> GET /?eid=39 HTTP/1.1
>>>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c
zlib/1.2
>>>>> .5 librtmp/2.3
>>>>>> Host: lucene.jugem.jp
>>>>>> Accept: */*
>>>>>> 
>>>>> < HTTP/1.1 200 OK
>>>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT
>>>>> < Server: Apache/2.0.59 (Unix)
>>>>> < Vary: User-Agent,Host,Accept-Encoding
>>>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>>>>> < Accept-Ranges: bytes
>>>>> < Content-Length: 22594
>>>>> < Cache-Control: private
>>>>> < Pragma: no-cache
>>>>> < Connection: close
>>>>> < Content-Type: text/html
>>>>> 
>>>>> There's no 302 from here.
>>>>> 
>>>>> Are you trying to crawl through a proxy?  If so, that might be where
>>>>> the problem lies.
>>>>> 
>>>>> Karl
>>>>> 
>>>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>> It sounds like the httpclient upgrade definitely broke something.
 We
>>>>>> should open a ticket.
>>>>>> 
>>>>>> But first, can you confirm what connector this is?  Is it the web
>>>>>> connector?  If so, I am puzzled because the web connector has always
>>>>>> logged any 302 return, but then queued a second document which it
>>>>>> subsequently fetches.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
>>>>>> <shinichiro.abe.1@gmail.com> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm using trunk code and crawling web site with seeds which have
http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
>>>>>>> As I'm look at Simple History, it shows 302 result code at fetch
activity and doesn't ingest document.
>>>>>>> 
>>>>>>> When I used MCF 1.0.1 in the same situation, Simple History showed
200 result code and MCF could ingest documents.
>>>>>>> 
>>>>>>> Why does the trunk shows 302 status? Is it relevant to upgrading
httpclient?
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> Shinichiro Abe
>>>> 
>> 


Mime
View raw message