nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex
Date Wed, 22 Apr 2015 21:08:59 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel resolved NUTCH-1062.
------------------------------------
       Resolution: Fixed
    Fix Version/s:     (was: 1.11)
                   1.10
                   2.4

Resolved by NUTCH-1990.

> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 2.4, 1.10
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I began the
migration mostly because of the double slash issue using lookback which was not supported
in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic
URL Normalizer has this problem built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? Migrate
to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash is added
for URI schema's http & ftp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message