nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <chris.mattm...@jpl.nasa.gov>
Subject RE: Urlfilter Patch
Date Thu, 01 Dec 2005 22:06:09 GMT
Hi Jerome,

> Yes, the fetcher can't rely on the document mime-type.
> The only thing we can use for filtering is the document's URL.
> So, another alternative, could be to exclude only files extensions that
> are
> registered in the mime-type repository
> (some well known file extensions) but for which no parser is activated.
> And
> accepting all other ones.
> So that the .foo files will be fetched...

Yup, the key phrase is "well known". It would sort of be an optimization, or
heuristic, to save some work on the regex...

Cheers,
  Chris


> 
> Jérôme


Mime
View raw message