nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <>
Subject [jira] Commented: (NUTCH-603) Add more default url normalizations
Date Tue, 12 Feb 2008 05:02:10 GMT


Dennis Kubes commented on NUTCH-603:

I am ok with commenting it out.  As long as it is there for people to use (instead of having
to create) I think it will be ok.  I will comment that out and if no objections will commit
that version.

> Add more default url normalizations
> -----------------------------------
>                 Key: NUTCH-603
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>         Attachments: NUTCH-603-1-20080205.patch
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding
in more default url normalizers including expressions for removing different types of session
ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point
of these expressions is to decrease the number of duplicate urls that are being stored and
scored in the crawl database and being fetched.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message