nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1098) better url-normalizer basic
Date Fri, 04 Nov 2011 16:33:51 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144132#comment-13144132
] 

Ferdy Galema commented on NUTCH-1098:
-------------------------------------

Like I said before, I'm up for converting spaces to %20, so that at least the fetcher will
not fail. Also I think convert lowercase escapings to uppercase is a good idea. (Though I'm
not complete sure if this is completely interchangable)

However I cannot commit this using the latest patch, because obviously it is intertwined with
other changes. Why is the url decoded before changes spaces into %20 whereafter it is encoded
again? Why "Remove % encoding from URL in range 0x20-0x80 exclusive / and # are not decoded"
and why "// this pattern tries to find spots like "%34" in the url"?

I think changes to default filtering/normalizing should be very well thought out, because
like Markus said the impact can be potentially very big. In short, I'm not able to commit
anything as of now. (But if anyone else is, don't be stopped by me). Just my 2 cents.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect
space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message