tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-304) HtmlParser could be easier to subclass
Date Fri, 09 Oct 2009 16:33:31 GMT

    [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764087#action_12764087
] 

Ken Krugler commented on TIKA-304:
----------------------------------

A few comments on this:

1. I think it's an improvement, not a bug :)

2. I agree that it would be great to be able to alter the behavior of HtmlParser. Making subclassing
easier is one approach, another might be the ability (IoC model) of specifying a different
content handler.

3. Preserving attributes is very important - I had a todo on my list to file an issue about
this. E.g. with links, there can be attributes like the target content language that you want
to preserve.

4. I have some mods for HtmlParser that I need to turn into issues/patches, e.g. link extraction
from <img>, <link>, etc tags. But I'd hate to put Jukka into n-way merge hell.
So I might wait for this patch to get rolled in first.

> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead
of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but
let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message