nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thamme Gowda N (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
Date Wed, 10 Feb 2016 17:45:18 GMT


Thamme Gowda N commented on NUTCH-2144:

Hi [~lewismc]
* I think relying on URL suffix based mimetype detection is a nice precision trade-off for
the gained speed.  I did this for one of my homework and to be honest, i disabled HEAD based
MIME type detection because it was taking lot of time. This patch is using basic java Regex
to filter. [~chrismattmann] I am not sure if Tika can take an URL and guess possible mime
type without making a HEAD call. Can you point me to an example, if there is one?
* Agreed. The same logic can be applied to permit certain urls in intra-domain. 

I am glad you liked it, Let me know what improvements needed to make this useful for wide

> Plugin to override db.ignore.external to exempt interesting external domain URLs
> --------------------------------------------------------------------------------
>                 Key: NUTCH-2144
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
> Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true)
to fetch static resources from external domains.
> The generalized version of this: This plugin should permit interesting URLs from external
domains (by overriding db.ignore.external). The interesting urls are decided from a combination
of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs to fetch
all images which may be linked from CDNs and other domains. In this scenario, allowing all
external links and then writing hundreds of regular expressions is not feasible for large
number of domains.

This message was sent by Atlassian JIRA

View raw message