nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2690) Configurable and fast URL filter
Date Mon, 06 May 2019 15:32:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833919#comment-16833919
] 

ASF GitHub Bot commented on NUTCH-2690:
---------------------------------------

sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and fast URL filter
URL: https://github.com/apache/nutch/pull/433
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Configurable and fast URL filter
> --------------------------------
>
>                 Key: NUTCH-2690
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2690
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming debatable)
which is in use at Common Crawl [since 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve host/domain-specific
regex rules
> # applies a regular expression against the path (and query) component of the URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are usually fewer
rules to be checked. That's similar to NUTCH-1838 but any domain suffix can be matched including
{{subdomain.domain.com}}, {{com}} or {{.}} for global rules. The selection by host name suffix
is considerably fast.
> - regexes are applied only to the path component (optionally including the query) and
not the entire URL.
>   Matching against a shorter string can make a huge difference for more complex regular
expressions.
> - the rule to deny everything from a host or domain gets special treatment to be fast
> More details about the rule format are found in the plugin's [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message