nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thamme Gowda N (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
Date Mon, 19 Oct 2015 06:54:05 GMT
Thamme Gowda N created NUTCH-2144:
-------------------------------------

             Summary: Plugin to override db.ignore.external to exempt interesting external
domain URLs
                 Key: NUTCH-2144
                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
             Project: Nutch
          Issue Type: New Feature
          Components: crawldb, fetcher
            Reporter: Thamme Gowda N
            Priority: Minor


Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true)
to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs from external
domains (by overriding db.ignore.external). The interesting urls are decided from a combination
of regex and mime-type rules.


Concrete use case:
  When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images
which may be linked from CDNs and other domains. In this scenario, allowing all external links
and then writing hundreds of regular expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message