nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Updated: (NUTCH-505) Outlink urls should be validated
Date Thu, 12 Jul 2007 12:17:04 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505-v2.patch

After my last commit, I read that Sun's java.util.regex implementation is actually faster
than jakarta-oro. So, I changed UrlValidator to use java.util.regex instead of jakarta-oro.
I made some simple tests and java.util.regex really seems to be faster. I also added some
basic optimizations to ParseOutputFormat (added initialCapacity arguments to ArrayLists to
reduce the number of allocations).

Is it necessary to reopen this issue or open another issue for this? I think this one is simple
enough to commit without opening a seperate issue, but feel free to disagree.

Also, I realized that UrlValidator considers http://www.iiit.net/images/CCCCCC_line_br[1].gif
invalid, even though firefox will display the gif (firefox escapes the path then fetches the
escaped url). This doesn't seem to be a problem right now since nutch can't fetch these urls
anyway, but we may consider adding some sort of smart escaping later.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch,
NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation system that
tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message