nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1919) Getting timeout when server returns Content-Length: 0
Date Thu, 15 Jan 2015 11:25:34 GMT
Julien Nioche created NUTCH-1919:
------------------------------------

             Summary: Getting timeout when server returns Content-Length: 0 
                 Key: NUTCH-1919
                 URL: https://issues.apache.org/jira/browse/NUTCH-1919
             Project: Nutch
          Issue Type: Bug
          Components: protocol
            Reporter: Julien Nioche
             Fix For: 1.10


This has been investigated in fixed in the Storm-Crawler [https://github.com/DigitalPebble/storm-crawler/issues/48].

{quote}
curl -I "http://www.dailynewslosangeles.com/"
HTTP/1.1 301 Moved Permanently
Location: http://www.dailynews.com
Connection: close
Content-Length: 0
Content-Type: text/html; charset=UTF-8
{quote}

when fetching with Nutch we are getting a timeout exception :

{quote}
./nutch parsechecker -D http.agent.name="PebbleCrawler" "http://www.dailynewslosangeles.com/"
fetching: http://www.dailynewslosangeles.com/
Fetch failed with protocol status: exception(16), lastModified=0: java.net.SocketTimeoutException:
Read timed out
{quote}

The reason for this is that we are trying to read from the stream even though we know that
the content length is 0.

The patch attached fixes the issue. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message