manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Http status code 302
Date Wed, 09 Jan 2013 12:08:56 GMT
Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
Koji's blog site does not like one of the headers, crawler-agent
perhaps?

I am behind a firewall now but I will explore this later today.  In
the meantime, if you want to research the problem, could you turn on
wire debugging?  You do this in the logging.ini file following these
instructions:

http://hc.apache.org/httpcomponents-client-ga/logging.html

You should see everything happening in the log then, and you can then
compare against curl using -vvv.  Please let me know what you find.

Thanks!
Karl

On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
<shinichiro.abe.1@gmail.com> wrote:
> I'm using web connector.
>
>> Are you trying to crawl through a proxy?
> No. I just set seeds that url without a proxy.
> (Also I didn't obey robots.txt)
>
> Using curl, it is the same as your result.
>
> Could you reproduce that?
>
> Shinichiro
>
> On 2013/01/09, at 17:49, Karl Wright wrote:
>
>> When I try the URL you gave using curl and no special arguments, I get this:
>>
>>
>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39"
>> * About to connect() to lucene.jugem.jp port 80 (#0)
>> *   Trying 210.172.160.170... connected
>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>> GET /?eid=39 HTTP/1.1
>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2
>> .5 librtmp/2.3
>>> Host: lucene.jugem.jp
>>> Accept: */*
>>>
>> < HTTP/1.1 200 OK
>> < Date: Wed, 09 Jan 2013 08:47:52 GMT
>> < Server: Apache/2.0.59 (Unix)
>> < Vary: User-Agent,Host,Accept-Encoding
>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>> < Accept-Ranges: bytes
>> < Content-Length: 22594
>> < Cache-Control: private
>> < Pragma: no-cache
>> < Connection: close
>> < Content-Type: text/html
>>
>> There's no 302 from here.
>>
>> Are you trying to crawl through a proxy?  If so, that might be where
>> the problem lies.
>>
>> Karl
>>
>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> It sounds like the httpclient upgrade definitely broke something.  We
>>> should open a ticket.
>>>
>>> But first, can you confirm what connector this is?  Is it the web
>>> connector?  If so, I am puzzled because the web connector has always
>>> logged any 302 return, but then queued a second document which it
>>> subsequently fetches.
>>>
>>> Karl
>>>
>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
>>> <shinichiro.abe.1@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I'm using trunk code and crawling web site with seeds which have http://lucene.jugem.jp/?eid=39
(koji's blog --I don't obey robots.txt).
>>>> As I'm look at Simple History, it shows 302 result code at fetch activity
and doesn't ingest document.
>>>>
>>>> When I used MCF 1.0.1 in the same situation, Simple History showed 200 result
code and MCF could ingest documents.
>>>>
>>>> Why does the trunk shows 302 status? Is it relevant to upgrading httpclient?
>>>>
>>>> Thanks in advance,
>>>> Shinichiro Abe
>

Mime
View raw message