nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Joke (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
Date Mon, 11 Feb 2008 16:36:11 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567693#action_12567693
] 

Emmanuel Joke commented on NUTCH-596:
-------------------------------------

I didn't find any usefull information in the Content Object to know if the Crawling has been
sucessfull.

So, i guess this suggestion can be eliminated.

I thought of another way to that, we can create a simplae Map/Reduce task to load CrawlDatum
+ Content and filter the content that has a DbStatus == Success. The output of this task will
be then used by the existing ParseSegment task. This solution avoid to Parse any content would
could caused any errors in the parsing.

Any thoughts ?

> ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-596
>                 URL: https://issues.apache.org/jira/browse/NUTCH-596
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Emmanuel Joke
>
> We have 2 choices to parse the content either within the Fetcher class or with the ParseSegment
class
> Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS nad if its
true it will parse the content.
> However we don't have this check in ParseSegment, thus we parse every content store on
the disk without checking the Status.
> So i think we should implement this check, i can see only 3 solutions:
> - read the status code in the Metadata of the Content object
> - don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
> - load the crawldatum object in ParseSegement
> What are your thoughts ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message