nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features
Date Sat, 05 Apr 2014 20:21:16 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-410:
--------------------------------

    Component/s:     (was: fetcher)

> Faster RegexNormalize with more features
> ----------------------------------------
>
>                 Key: NUTCH-410
>                 URL: https://issues.apache.org/jira/browse/NUTCH-410
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>         Environment: Tested on MacOS X 10.4.7/10.4.8
>            Reporter: Doug Cook
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: betterRegexNorm.patch
>
>
> The patch associated with this is backwards-compatible and has several improvements over
the stock 0.8 RegexURLNormalizer:
> 1) About a 34% performance improvement, from only executing the superclass (BasicURLNormalizer)
once in most cases, instead of twice as the stock version did. 
> 2) Support for expensive host-specific normalizations with good performance. Each <regex>
block optionally takes a list of hosts to which to apply the associated regex. If supplied,
the regex will only be applied to these hosts. This should have scalable performance; the
comparison is O(1) regardless of the number of hosts. The format is:
>     <regex>
>         <host>www.host1.com</host>
>         <host>host2.site2.com</host>
>         <pattern> my pattern here </pattern>
>         <substitution> my substitution here </substitution>
>    </regex>
> 3)  Support for decoding URLs with escaped character encodings (e.g. %20, etc.). This
is useful, for example, to decode "jump redirects" which have the target URL encoded within
the source, as on Yahoo. I tried to create an extensible notion of "options," the first of
which is "unescape." The unescape function is applied *after* the substitution and *only*
if the substitution pattern matches. A simple pattern to unescape Yahoo directory redirects
would be something like:
> <regex>
>   <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&amp;]+)</pattern>
>   <substitution>$1</substitution>
>   <options>unescape</options>
> </regex>
> 4) Added the notion of iterating the pattern chain. This is useful when the result of
a normalization can itself be normalized. While some of this can be handled in the stock version
by repeating patterns, or by careful ordering of patterns, the notion of iterating is cleaner
and more powerful. The chain is defined to iterate only when the previous iteration changes
the input, up to a configurable maxium number of iterations. The config parameter to change
is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous behavior). The change
is performance-neutral when disabled, and has a relatively small performance cost when enabled.
> Pardon any potentially unconventional Java on my part. I've got lots of C/C++ search
engine experience, but Nutch is my first large Java app. I welcome any feedback, and hope
this is useful.
> Doug



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message