nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
Date Mon, 29 Feb 2016 07:06:18 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann resolved NUTCH-2144.
--------------------------------------
    Resolution: Fixed

OK all fixed thanks [~thammegowda]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 224, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (51/51), 10.10 KiB | 0 bytes/s, done.
Total 51 (delta 25), reused 0 (delta 0)
To https://git-wip-us.apache.org/repos/asf/nutch.git
   f5e430e..15c583e  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> Plugin to override db.ignore.external to exempt interesting external domain URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true)
to fetch static resources from external domains.
> The generalized version of this: This plugin should permit interesting URLs from external
domains (by overriding db.ignore.external). The interesting urls are decided from a combination
of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs to fetch
all images which may be linked from CDNs and other domains. In this scenario, allowing all
external links and then writing hundreds of regular expressions is not feasible for large
number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message