nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
Date Thu, 08 Sep 2005 19:24:35 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12322955 ] 

Andrzej Bialecki  commented on NUTCH-88:
----------------------------------------

Additional issue: the plugin descriptor currently allows to put just a single mime type. It
is a realistic scenario that some plugins can handle multiple content types.

Moreover, plugins could handle content types with varying degrees of "faithfulness" or precision
- e.g. there could be three parsers for PDF, one that is able to do a simple text extraction
(jpedal), and another that can also handle more complex PDF with metadata (pdfbox), and yet
another that cannot handle metadata, but can preserve the layout (pdftohtml). Currently there
is no way to express a preference of one plugin over another, if both support the same content
type.

> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
>
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>      Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix
defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" attribute matches
the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of the url's
path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the empty string
is used.
> This policy has a lot of problems when no matching is found, because a random parser
is used (and there is a lot of chance this parser can't handle the content).
> On the other hand, the content-type associated to a parser plugin is specified in the
plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each
parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type
value, and not uses the value specified in the plugin.xml => possibility of missmatches
between content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message