nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <jerome.char...@gmail.com>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 21:29:13 GMT
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the document's URL.
So, another alternative, could be to exclude only files extensions that are
registered in the mime-type repository
(some well known file extensions) but for which no parser is activated. And
accepting all other ones.
So that the .foo files will be fetched...

Jérôme

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message