tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-304) HtmlParser could be easier to subclass
Date Fri, 09 Oct 2009 16:33:31 GMT

    [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764087#action_12764087

Ken Krugler commented on TIKA-304:

A few comments on this:

1. I think it's an improvement, not a bug :)

2. I agree that it would be great to be able to alter the behavior of HtmlParser. Making subclassing
easier is one approach, another might be the ability (IoC model) of specifying a different
content handler.

3. Preserving attributes is very important - I had a todo on my list to file an issue about
this. E.g. with links, there can be attributes like the target content language that you want
to preserve.

4. I have some mods for HtmlParser that I need to turn into issues/patches, e.g. link extraction
from <img>, <link>, etc tags. But I'd hate to put Jukka into n-way merge hell.
So I might wait for this patch to get rolled in first.

> HtmlParser could be easier to subclass
> --------------------------------------
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
> It would be nice if one could subclass HtmlParser to change what it passes along, instead
of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but
let's see how you like my first patch.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message