nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-603) Add more default url normalizations
Date Tue, 12 Feb 2008 15:03:09 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-603:
-------------------------------

    Attachment: NUTCH-603-2-20080212.patch

This patch comments out the default page removal (i.e. index.html) and adds the _ character
to be removed if attached to session ids (i.e. _sessionid)

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch, NUTCH-603-2-20080212.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding
in more default url normalizers including expressions for removing different types of session
ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point
of these expressions is to decrease the number of duplicate urls that are being stored and
scored in the crawl database and being fetched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message