nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1101) Options to purge db_gone records in updatedb
Date Tue, 06 Sep 2011 12:56:09 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1101:
---------------------------------

    Attachment: NUTCH-1101-1.4-2.patch

Thanks Julien! I've modified the stuff to rely on the config option and ability to use Tool
to set the option on the CLI. Tested and confirmed to work.

Also added setting to nutch-default.

> Options to purge db_gone records in updatedb
> --------------------------------------------
>
>                 Key: NUTCH-1101
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1101
>             Project: Nutch
>          Issue Type: New Feature
>          Components: linkdb
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1101-1.4-1.patch, NUTCH-1101-1.4-2.patch
>
>
> Add option to updatedb to filter out records with status db_gone (http 404). This is
especially useful in cases where a crawl db is targetted at only a specific site. If the site,
for some reason, suddenly changes a lot of url's we'll get a crawl db filled with garbage.
Since the targetted site is known (or controlled) it is safe to get rid of all these url's:
reduce db size, reduce useless http requests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message