tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-310) Use TagSoup to parse HTML
Date Wed, 14 Oct 2009 19:40:31 GMT
Use TagSoup to parse HTML

                 Key: TIKA-310
                 URL: https://issues.apache.org/jira/browse/TIKA-310
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting

The NekoHTML library we currently use for parsing HTML has a transitive dependency on Apache
Xerces. The Xerces library is pretty big (1.2MB) and is known to cause various problems when
included in the classpath of an application or a container that expects some other XML parser

The TagSoup library (http://home.ccil.org/~cowan/XML/tagsoup/) provides an alternative HTML
parsing library that works pretty much like NekoHTML but doesn't depend on Xerces. I suggest
we switch from NekoHTML to TagSoup unless this change causes major regressions in HTML parsing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message