nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
Date Fri, 06 Mar 2009 12:53:56 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679582#action_12679582
] 

Andrzej Bialecki  commented on NUTCH-712:
-----------------------------------------

I'm not sure that ignoring this exception is the right thing to do ... if we fail to normalize
the url, we also fail to filter it. This means that if we proceed as if nothing happened (which
your patch does) we could end up with many unfiltered junk urls.

I think a better alternative is to return, i.e. to skip this record without further processing.

> ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-712
>                 URL: https://issues.apache.org/jira/browse/NUTCH-712
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: ParseOutputFormat-NUTCH712.patch
>
>
> ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message