nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Updated: (NUTCH-505) Outlink urls should be validated
Date Tue, 10 Jul 2007 19:12:05 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505.patch

New version of the patch. As Andrzej has pointed out, db.max.outlinks.per.page is read once
per getRecordWriter now.

> * you should increase the version number of ParseData, and add a code to read the current
version of ParseData. > Otherwise the updated code won't be able to read older segments.


This patch doesn't how parse data reads outlinks. Before this patch, parse data used to read
db.max.outlinks.per.page outlinks then skip over (as in read the outlink then ignore it) the
rest. After this patch, parse data reads all outlinks. So, I/O behaviour is the same

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that
tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message