nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Updated: (NUTCH-633) ParseSegment no longer allow reparsing
Date Fri, 19 Sep 2008 13:18:45 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney updated NUTCH-633:
--------------------------------

    Attachment: NUTCH_633.patch

OK, I shouldn't have missed this one :)

Anyway, I think it is better to modify the fetchers so that they always store FETCH_STATUS_KEY
instead of modifying parser.

And, here is a patch which does exactly that :D

> ParseSegment no longer allow reparsing
> --------------------------------------
>
>                 Key: NUTCH-633
>                 URL: https://issues.apache.org/jira/browse/NUTCH-633
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: any
>            Reporter: Xue Yong Zhi
>            Priority: Minor
>         Attachments: NUTCH_633.patch
>
>
> ParseSegment used to allow reparsing even if parsing has been enabled in Fetcher. But
now it throws a NumberFormatException as 'content.getMetadata().get(Nutch.FETCH_STATUS_KEY)'
is null.
> This patch will fix the problem:
> --- a/src/java/org/apache/nutch/parse/ParseSegment.java
> +++ b/src/java/org/apache/nutch/parse/ParseSegment.java
> @@ -70,8 +70,10 @@ public class ParseSegment extends Configured implements Tool, Mapper<WritableCom
>        key = newKey;
>      }
>      
> +    //status_key is only available when parsing is not done in fetcher
> +    String status_key = content.getMetadata().get(Nutch.FETCH_STATUS_KEY);
>      int status =
> -      Integer.parseInt(content.getMetadata().get(Nutch.FETCH_STATUS_KEY));
> +      (null == status_key) ? CrawlDatum.STATUS_FETCH_SUCCESS : Integer.parseInt(status_key);
>      if (status != CrawlDatum.STATUS_FETCH_SUCCESS) {
>        // content not fetched successfully, skip document
>        LOG.debug("Skipping " + key + " as content is not fetched successfully");

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message