nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Wed, 28 Feb 2007 16:34:58 GMT


Andrzej Bialecki  commented on NUTCH-443:

Re: the "fake" CrawlDatum-s: this looks ugly no matter which way we look at it ... :| It appears
you were right from the start, FETCH_TIME_KEY seems to be the lesser evil at the moment.

Re: ParseResult.filter(): indeed - in fact, there is an inconsistency between what Fetcher
does and what ParseSegment does. Fetcher actually stores the information about failed parsing
- I had an impression that ParseSegment does this too. IMHO it's a good opportunity to fix
this so that it works the same way in both places. Currently this information is used only
in SegmentReader to provide the info about the total numbers of generated, fetched and parsed
urls. However, other tools may use it to determine the failure rate of a specific parser ...
so I would hate to discard it.

Re: ParseImpl.isFetched compat issue - I was wrong here. That's a relief - I hate such complications


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>                 Key: NUTCH-443
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch,
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message