nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Groschupf (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host
Date Sat, 03 Jun 2006 19:44:29 GMT
if a 404 for a robots.txt is returned no page is fetched at all from the host
-----------------------------------------------------------------------------

         Key: NUTCH-298
         URL: http://issues.apache.org/jira/browse/NUTCH-298
     Project: Nutch
        Type: Bug

    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES;
" (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed
in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
        entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks and initialisations.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message