nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <>
Subject [jira] Updated: (NUTCH-603) Add more default url normalizations
Date Tue, 05 Feb 2008 17:04:08 GMT


Dennis Kubes updated NUTCH-603:

    Attachment: NUTCH-603-1-20080205.patch

Added normalizations for removing different session ids, for changing default pages such as
index.html to /, for removing #something interpage anchors, and for cleaning up urls such
as multiple ampersands, ending ?, ., or & characters.  Unit tests were added to show results
of expressions.  All current expressions were tuned for performance.

> Add more default url normalizations
> -----------------------------------
>                 Key: NUTCH-603
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>         Attachments: NUTCH-603-1-20080205.patch
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding
in more default url normalizers including expressions for removing different types of session
ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point
of these expressions is to decrease the number of duplicate urls that are being stored and
scored in the crawl database and being fetched.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message