nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Updated: (NUTCH-505) Outlink urls should be validated
Date Tue, 10 Jul 2007 12:42:05 GMT


Doğacan Güney updated NUTCH-505:

    Attachment: NUTCH-505.patch

New patch. This is sort of a release candidate, if there are no objections, I think this patch
can go in as it is.

The biggest change is that ParseData is no longer a Configurable. In the current implementation,
when a parse data comes to ParseOutputFormat, it contains at most,
then after filtering, ParseOutputFormat outputs whatever remains. 

For example, in a situation where ignoreExternalLinks is true and the first hundred links
(assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted,
even if there are internal urls past 100th outlinks mark.

So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most many outlinks (Also resulting parse data contains
outlinks too). I think this is a better approach but it may be a bit slower.

Besides this change, UrlValidator code is cleaned up and moved into package.
Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in
Outlink.Outlink. There is no point in normalizing them twice.

> Outlink urls should be validated
> --------------------------------
>                 Key: NUTCH-505
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
> See discussion here:
> Parse plugins may extract garbage urls from pages. We need a url validation system that
tests these urls and filters out garbage.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message