nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arkadi Kosmynin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1993) Nutch does not use backup parsers
Date Tue, 21 Apr 2015 06:01:07 GMT
Arkadi Kosmynin created NUTCH-1993:
--------------------------------------

             Summary: Nutch does not use backup parsers
                 Key: NUTCH-1993
                 URL: https://issues.apache.org/jira/browse/NUTCH-1993
             Project: Nutch
          Issue Type: Bug
          Components: parser
            Reporter: Arkadi Kosmynin


>From reading the code it is clear that it is designed to allow using several parsers to
parse a document in a sequence, until it is successfully parsed. In practice, this does not
work because these lines 

if (parseResult != null && !parseResult.isEmpty())
        return parseResult;

break the loop even if the parsing has failed because parseResult is not empty anyway, it
contains a ParseData with ParseStatus.FAILED.

A fix:

if ( parseResult.isAnySuccess() ) 
        return parseResult;

Where parseResult.isAnySuccess() returns true if any of the parsing attempts were successful.

This fix is important because it allows use of backup parsers as originally designed and thus
increase index completeness.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message