nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb
Date Thu, 06 Apr 2017 11:23:41 GMT


ASF GitHub Bot commented on NUTCH-2335:

sebastian-nagel closed pull request #158: NUTCH-2335 Injector not to filter and normalize
existing items/URLs in CrawlDb
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>                 Key: NUTCH-2335
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>             Fix For: 1.14
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are added to
an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already in the
> The default should be as before not to filter existing URLs. Filtering and normalizing
may take long for large CrawlDbs and/or complex URL filters. If URL filter or normalizer rules
are not changed there is no need to apply them anew every time new URLs are added. Of course,
injected URLs should be filtered and normalized by default.

This message was sent by Atlassian JIRA

View raw message