nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 21:49:59 GMT
>Suggestion:
>For consistency purpose, and easy of nutch management, why not filtering the
>extensions based on the activated plugins?
>By looking at the mime-types defined in the parse-plugins.xml file and the
>activated plugins, we know which content-types will be parsed.
>So, by getting the file extensions associated to each content-type, we can
>build a list of file extensions to include (other ones will be excluded) in
>the fetch process.

I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the file 
suffixes found in regex-urlfilter.txt and the enabled plug-ins found 
in nutch-default.xml and nutch-site.xml. What's the point of 
downloading a 100MB .bz2 file if there's nobody available to handle 
it?

It's also odd that there's a nutch-site.xml, but no equivalent for 
regex-urlfilter.txt.

There are the cases of some suffixes (like .php) that can return any 
kind of mime-type content, and other suffixes (like .xml) that can 
mean any number of things. So I think you'd still want 
regex-urlfilter.txt files (both a default and a site version) that 
provide explicit additions/deletions to the list generated from the 
installed and enabled parse-plugins.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Mime
View raw message