nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 18:43:05 GMT
Ken Krugler wrote:
> For what it's worth, below is the filter list we're using for doing an 
> html-centric crawl (no word docs, for example). Using the (?i) means we 
> don't need to have upper & lower-case versions of the suffixes.
> 
> -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


This looks like a more complete suffix list.

Should we use this as the default?  By default only html and text 
parsers are enabled, so perhaps that's all we should accept.

Why do you exclude .php urls?  These are simply dynamic pages, no? 
Similarly, .jsp and .py are frequently suffixes that return html.  Are 
there other suffixes we should remove from this list before we make it 
the default exclusion list?

Doug

Mime
View raw message