tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-458) Specify HTMLHandler via Context
Date Wed, 07 Jul 2010 22:11:52 GMT

    [ https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886110#action_12886110
] 

Jukka Zitting commented on TIKA-458:
------------------------------------

The reason why I originally didn't do this was to avoid making it a backwards-compatibility
requirement that the HTML parser uses a SAX content handler to internally process HTML documents.
This assumption may no longer hold if we decide to use libraries like boilerpipe (see TIKA-420)
as the default HTML parsing mechanism.

That said, I guess in this case the benefits probably outweight the possible drawbacks of
increased backwards-compatibility requirements on the HTML parser design.

About the patch itself, the proposed design of the way HTMLHandler is used is a bit troublesome
as the only way for a custom HTMLHandler to access the output ContenHandler, the Metadata
instance and the parse context is if they've been passed in to the custom HTMLHandler instance
by the client application. This won't work correctly for example when working with composite
documents like Zip archives. A better solution might be to introduce a factory interface like
this:

    public interface HTMLHandlerFactory {
        ContentHandler createHTMLHandler(
            ContentHandler handler, Metadata metadata, ParseContext context);
    }

PS. The patch seems to contain a few unrelated changes to the HTML parser. Can you handle
file separate issues for those changes?

PPS. It would be better if we used only spaces for indentation.


> Specify HTMLHandler via Context
> -------------------------------
>
>                 Key: TIKA-458
>                 URL: https://issues.apache.org/jira/browse/TIKA-458
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>         Attachments: TIKA-458.patch
>
>
> One of the recent changes on Tika is the possibility to specify a custom HTMLMapper via
the Context - which I think is an elegant mechanism. I was wondering whether there would be
a reason NOT to be able to do the same for the HTMLHandler and if nothing is passed via the
Context, rely on the current implementation. This would give more control to the user on what
to do with the SAX events while at the same time preserving the functionality by default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message