nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer
Date Mon, 20 Apr 2015 21:47:58 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-1990:
-----------------------------------
    Attachment: NUTCH-1990-v1.patch

Uuuh, a lot of garbage :(  I've also run the test after spending BasicURLNormalizer a main()
method:
* found another bug in the current version: "http://107jamz.com/registration/?referer=http://107jamz.com"
looses the double slash in the query part. That's because currently the slash and dot segment
normalization is run on the part returned by url.getFile(). Should be run only on the part
returned getPath(). But that's fixed by the new version.
* the trial is 50% slower using Julien's test set. But that's expected because only a small
fraction of the URLs contains paths with dot segments or double slashes.
* but after a check is added to avoid needless work: it's as fast as previously (maybe a slightly
faster): 0:49.78 (before), 1:03.11 (trial), 0:45.49 (patch v1)


> Use URI.normalise() in BasicURLNormalizer
> -----------------------------------------
>
>                 Key: NUTCH-1990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1990
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch
>
>
> One of the things that [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex library, we should
simply use [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] which
does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message