nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Joke (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
Date Sun, 30 Dec 2007 09:52:43 GMT
ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
---------------------------------------------------------------------------

                 Key: NUTCH-596
                 URL: https://issues.apache.org/jira/browse/NUTCH-596
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Emmanuel Joke


We have 2 choices to parse the content either within the Fetcher class or with the ParseSegment
class
Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS nad if its true
it will parse the content.

However we don't have this check in ParseSegment, thus we parse every content store on the
disk without checking the Status.

So i think we should implement this check, i can see only 3 solutions:
- read the status code in the Metadata of the Content object
- don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
- load the crawldatum object in ParseSegement

What are your thoughts ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message