nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-603) Add more default url normalizations
Date Mon, 11 Feb 2008 20:16:08 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567797#action_12567797
] 

Dennis Kubes commented on NUTCH-603:
------------------------------------

If nobody has any objections I will go ahead and commit this tonight.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding
in more default url normalizers including expressions for removing different types of session
ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point
of these expressions is to decrease the number of duplicate urls that are being stored and
scored in the crawl database and being fetched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message