nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash
Date Mon, 17 Mar 2008 13:23:24 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579438#action_12579438
] 

Andrzej Bialecki  commented on NUTCH-620:
-----------------------------------------

It would be interesting to see the source HTML, which causes these links to appear ... I think
your point is valid, Nutch should collapse such adjacent slashes. Could you provide a patch
to BasicURLNormalizer that implements this rule?

> BasicURLNormalizer should collapse runs of slashes with a single slash
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-620
>                 URL: https://issues.apache.org/jira/browse/NUTCH-620
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
>            Reporter: Mark DeSpain
>            Priority: Minor
>             Fix For: 1.0.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The BasicURLNormalizer should collapse runs of slash characters '/' with a single slash.
 
> For example,  the following URLs should be normalized to http://lucene.apache.org/nutch/about.html
> * http://lucene.apache.org/nutch//about.html 
> * http://lucene.apache.org//nutch/about.html 
> * http://lucene.apache.org/////nutch////about.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message