nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cook (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-365) Flexible URL normalization
Date Sat, 09 Sep 2006 16:00:23 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ] 
            
Doug Cook commented on NUTCH-365:
---------------------------------

PS. I like your idea of combining URL filters & normalization. In a sense, a "filter"
is just a normalizer that happens to normalize the URL either to itself or to nothing. It's
a nice abstraction if we can implement such "normalizers" as efficiently as the current filters.

If we iterated over these new "normalizers,"
and allowed for a flexible combination of normalizers, as we do with filters, with short-circuit
evaluation, then the first pass could throw away the obvious garbage (file types we don't
handle, advertisements, etc.), and later passes could normalize and then filter the normalized
URLs.

Also on a related note, I was just starting to think about how to implement efficient site-specific
normalizations and use these to handle (an already large number of) site mirrors as well as
(an increasing number of) site-specific patterns for things like session-ID removal.

> Flexible URL normalization
> --------------------------
>
>                 Key: NUTCH-365
>                 URL: http://issues.apache.org/jira/browse/NUTCH-365
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so much that
I decided to create a separate issue. It changes the URL normalization from a selectable single
class to a flexible and context-aware chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by "urlnormalizer.order" property,
which lists space-separated implementation classes. If there are more normalizers active than
explicitly named on this list, they will be run in random order after the ones specified on
the list are executed.
> * define a set of contexts (or scopes) in which normalizers may be called. Each scope
can have its own list of normalizers (via "urlnormalizer.scope.<scope_name>" property)
and its own order (via "urlnormalizer.order.<scope_name>" property). If any of these
properties are missing, default settings are used.
> * each normalizer may further select among many configurations, depending on the context
in which it is called, using a modified API:
>    URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will be used.
> * several standard contexts / scopes have been defined, and various applications have
been modified to attempt using appropriate normalizer in their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area, perhaps we
should combine urlfilters and urlnormalizers into a single subsystem of url munging - now
that we have support for scopes and flexible combinations of normalizers we could turn URLFilters
into a special case of normalizers (or vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message