nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
Date Wed, 10 Feb 2016 17:19:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141213#comment-15141213
] 

Lewis John McGibbney commented on NUTCH-2144:
---------------------------------------------

Hi [~thammegowda], limitations I see are as follows
 * as mentioned, the HEAD is going to slow stuff down. I see you're FIXME. I have a suggestion
for the time being. Lets think about initially addressing the case where we don't bother with
HEAD, we just reply upon mimeType detection through evaluation of URL suffix. What do you
think about this?
 * I feel that the invocation of this entire plugin could be extended to also deal with db.ignore.internal.
The exact same may apply for the use case when we wish to crawl images from a set of domains,
the crawler needs to fetch all images which may be linked internally but I have a list of
say 5000 of these domains. In this scenario, allowing all internal links and then writing
hundreds of regular expressions is not feasible for large number of domains.

This is a nice patch and a lot of work. I like the extension point.

> Plugin to override db.ignore.external to exempt interesting external domain URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true)
to fetch static resources from external domains.
> The generalized version of this: This plugin should permit interesting URLs from external
domains (by overriding db.ignore.external). The interesting urls are decided from a combination
of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs to fetch
all images which may be linked from CDNs and other domains. In this scenario, allowing all
external links and then writing hundreds of regular expressions is not feasible for large
number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message