tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject FYI: NekoHTML/Xerces dependency replaced with TagSoup
Date Wed, 14 Oct 2009 19:57:35 GMT

As noted in TIKA-310, I've replaced the NekoHTML dependency (and the
transitive Xerces one) with the TagSoup library.

Based on quick testing TagSoup works just as well (if not better) for
our needs than NekoHTML, and the dependency change helped cut the
tika-app jar size from 27MB to 25MB. Most notably this change removes
the Xerces dependency that is troublesome for many environments that
depend on some specific XML parser being picked up by JAXP.

However, since this is a pretty notable change to a core feature,
please try out the latest trunk and report any problems if you use
Tika for parsing lots of HTML.


Jukka Zitting

View raw message