nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-603) Add more default url normalizations
Date Mon, 11 Feb 2008 22:57:08 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567860#action_12567860
] 

Andrzej Bialecki  commented on NUTCH-603:
-----------------------------------------

I'm of a split mind towards one of these rules: the one that strips /index.html and similar.
I know of a few sites where /index.html != /index.php, I even remember creating one like that
:) Some sites redirect / not to /index.html but somewhere down in the hierarchy, and they
don't have any proper /index.html at all. In other words, I vote for removing this rule, or
at least commenting it out.

Other than that, +1.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding
in more default url normalizers including expressions for removing different types of session
ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point
of these expressions is to decrease the number of duplicate urls that are being stored and
scored in the crawl database and being fetched.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message