nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Updated: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
Date Thu, 17 Apr 2008 15:03:21 GMT


Doğacan Güney updated NUTCH-596:

    Attachment: NUTCH-596_v1.patch

A simple patch for option 1, puts fetch status in content metadata and retrieves in during
parse, skipping over records if status is not FETCH_SUCCESS....

> ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
> ---------------------------------------------------------------------------
>                 Key: NUTCH-596
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Emmanuel Joke
>         Attachments: NUTCH-596_v1.patch
> We have 2 choices to parse the content either within the Fetcher class or with the ParseSegment
> Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS nad if its
true it will parse the content.
> However we don't have this check in ParseSegment, thus we parse every content store on
the disk without checking the Status.
> So i think we should implement this check, i can see only 3 solutions:
> - read the status code in the Metadata of the Content object
> - don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
> - load the crawldatum object in ParseSegement
> What are your thoughts ?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message