nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
Date Fri, 02 Sep 2011 14:00:11 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989
] 

Ferdy edited comment on NUTCH-1097 at 9/2/11 1:59 PM:
------------------------------------------------------

After digging into it for a while, I believe the best solution for now is to allow regexes
in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml
can be supported by the plugin.xml of the individual parser extensions. (Instead of plain
using the wildcard 'asterisk')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of
extensions, because a lot of mimetypes contain this character. This will not break existing
functionality. So you can use any regular expression supported by the standard Java Pattern
except the '+' character. The wildcard 'asterisk' is still usable, because this one is checked
first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid
regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with
escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent
with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe
the current situation is not flexible enough. (Especially the fact that many-to-one mappings
of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions
or corrections feel free to reply.

      was (Author: ferdy.g):
    After digging into it for a while, I believe the best solution for now is to allow regexes
in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml
can be supported by the plugin.xml of the individual parser extensions. (Instead of plain
using the wildcard '*')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of
extensions, because a lot of mimetypes contain this character. This will not break existing
functionality. So you can use any regular expression supported by the standard Java Pattern
except the '+' character. The wildcard '*' is still usable, because this one is checked first
in ParserFactory. (Otherwise an exception occurs because '*' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with
escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent
with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe
the current situation is not flexible enough. (Especially the fact that many-to-one mappings
of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions
or corrections feel free to reply.
  
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes
for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml,
however the plugin.xml of this plugin does not list this type. Either change the entry in
parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message