nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
Date Mon, 11 Feb 2008 16:52:09 GMT


Andrzej Bialecki  commented on NUTCH-596:

bq. I didn't find any usefull information in the Content Object to know if the Crawling has
been sucessfull.

Well, the idea was to add ProtocolStatus code to Content.metadata, and then retrieve it in
ParseSegment. It is a hack, but it carries a minimal impact, both on the code and on the amount
of processed data.

bq. I thought of another way to that, we can create a simplae Map/Reduce task to load CrawlDatum
+ Content and filter the content that has a DbStatus == Success. The output of this task will
be then used by the existing ParseSegment task. This solution avoid to Parse any content would
could caused any errors in the parsing.

This costs even more than the option #3 (add crawl_fetch as one of the inputs), because you
still need to process the same data, but now you need to run 2 separate jobs. If we go this
road, it's better just to stick with option #3.

> ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
> ---------------------------------------------------------------------------
>                 Key: NUTCH-596
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Emmanuel Joke
> We have 2 choices to parse the content either within the Fetcher class or with the ParseSegment
> Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS nad if its
true it will parse the content.
> However we don't have this check in ParseSegment, thus we parse every content store on
the disk without checking the Status.
> So i think we should implement this check, i can see only 3 solutions:
> - read the status code in the Metadata of the Content object
> - don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
> - load the crawldatum object in ParseSegement
> What are your thoughts ?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message