nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 21:16:04 GMT
Hi Doug,


On 12/1/05 1:11 PM, "Doug Cutting" <cutting@nutch.org> wrote:

> Jérôme Charron wrote:
[...]
> 
> What about a site that develops a content system that has urls that end
> in .foo, which we would exclude, even though they return html?
> 
> Doug

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like. I'm not sure if the mime type registry is
there yet, but I know that Jerome was working on a major update that would
help in recognizing these types of situations. Of course, efficiency comes
into play as well, in terms of now slowing down the fetch/parse, but it
would be nice to have a general solution that made use of the information
available in parse-plugins.xml to determine the appropriate set of allowed
extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say
it's worth exploring...

Cheers,
  Chris



______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Mime
View raw message