nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Howie Wang" <howie_w...@hotmail.com>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 18:53:20 GMT
.pl  files are often just perl CGI scripts. And .xhtml seem like they
would be parsable by the default HTML parser.

Howie

>From: Doug Cutting <cutting@nutch.org>
>
>Ken Krugler wrote:
>>For what it's worth, below is the filter list we're using for doing an 
>>html-centric crawl (no word docs, for example). Using the (?i) means we 
>>don't need to have upper & lower-case versions of the suffixes.
>>
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
>
>This looks like a more complete suffix list.
>
>Should we use this as the default?  By default only html and text parsers 
>are enabled, so perhaps that's all we should accept.
>
>Why do you exclude .php urls?  These are simply dynamic pages, no? 
>Similarly, .jsp and .py are frequently suffixes that return html.  Are 
>there other suffixes we should remove from this list before we make it the 
>default exclusion list?
>
>Doug



Mime
View raw message