nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Updated: (NUTCH-505) Outlink urls should be validated
Date Tue, 10 Jul 2007 12:42:05 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505.patch

New patch. This is sort of a release candidate, if there are no objections, I think this patch
can go in as it is.

The biggest change is that ParseData is no longer a Configurable. In the current implementation,
when a parse data comes to ParseOutputFormat, it contains at most db.max.outlinks.per.page,
then after filtering, ParseOutputFormat outputs whatever remains. 

For example, in a situation where ignoreExternalLinks is true and the first hundred links
(assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted,
even if there are internal urls past 100th outlinks mark.

So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most
db.max.outlinks.per.page many outlinks (Also resulting parse data contains db.max.outlinks.per.page
outlinks too). I think this is a better approach but it may be a bit slower.

Besides this change, UrlValidator code is cleaned up and moved into org.apache.nutch.net package.
Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in
Outlink.Outlink. There is no point in normalizing them twice.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that
tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message