nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chirag Chaman">
Subject Bad URLs causing SEVERE exception
Date Tue, 05 Jul 2005 20:47:34 GMT

Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.

These links were getting thru -- thus we changed the URL filter to only
accept valid URL.

As someone else may face the same issue, here is the RE -- this should go
towards the end of your regex-urlfilter.txt.   It would be nice if one of
the committers could add this to the default file and comment it out.

# accept http only - valid URLs only

NOTE: This is only good for Web crawling, if you need intranet crawling do
not use this as it will not let any URL thru without at least one period.

Filangy, Inc.

View raw message