nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy
Date Thu, 08 Sep 2005 15:45:27 GMT
Hi Jerome,

  I may have some time to work on this over the next few days if no one else
does. So, if you're taking the lead on this, I volunteer my help if you'd
like it.

Thanks,
 Chris



On 9/8/05 2:06 AM, "Jerome Charron (JIRA)" <jira@apache.org> wrote:

> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
> 
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>      Fix For: 0.8-dev
> 
> 
> The ParserFactory choose the Parser plugin to use based on the content-types
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType"
> attribute matches the beginning of the content's type is used.
> If none match, then the first whose "pathSuffix" attribute matches the end of
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the
> empty string is used.
> 
> This policy has a lot of problems when no matching is found, because a random
> parser is used (and there is a lot of chance this parser can't handle the
> content).
> On the other hand, the content-type associated to a parser plugin is specified
> in the plugin.xml of each plugin (this is the value used by the
> ParserFactory), AND the code of each parser checks itself in its code if the
> content-type is ok (it uses an hard-coded content-type value, and not uses the
> value specified in the plugin.xml => possibility of missmatches between
> content-type hard-coded and content-type delcared in plugin.xml).
> 
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 




Mime
View raw message