tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-310) Use TagSoup to parse HTML
Date Wed, 14 Oct 2009 19:50:31 GMT

     [ https://issues.apache.org/jira/browse/TIKA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved TIKA-310.

       Resolution: Fixed
    Fix Version/s: 0.5

Replaced NekoHTML with TagSoup in revision 825239.

> Use TagSoup to parse HTML
> -------------------------
>                 Key: TIKA-310
>                 URL: https://issues.apache.org/jira/browse/TIKA-310
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.5
> The NekoHTML library we currently use for parsing HTML has a transitive dependency on
Apache Xerces. The Xerces library is pretty big (1.2MB) and is known to cause various problems
when included in the classpath of an application or a container that expects some other XML
parser library.
> The TagSoup library (http://home.ccil.org/~cowan/XML/tagsoup/) provides an alternative
HTML parsing library that works pretty much like NekoHTML but doesn't depend on Xerces. I
suggest we switch from NekoHTML to TagSoup unless this change causes major regressions in
HTML parsing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message