nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cook <>
Subject Anyone looked for a better HTML parser?
Date Mon, 15 Oct 2007 20:44:53 GMT

I've spent quite a bit of time working with both Neko and Tagsoup, and they
both have some fairly serious bugs:

Neko has some occasional hangs, and it doesn't deal very well with a fair
amount of "bad" HTML that displays just fine in a browser. 

Tagsoup is better in terms of handling "bad" HTML, but it has a pretty
serious bug in that HTML character entities are expanded in inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the form has problems: the &sub is interpreted as an
HTML character entity, and an invalid href is created.  John Cowan, the
author of Tagsoup, more or less said "yeah, I know, everybody mentions that,
but that's done at such a low level in the code it's not likely to get fixed
any time soon". (See a discussion of this and other issues at 

The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
fatal, and I *know* Neko misses some text, sometimes entire documents,
because it can't deal with pathological HTML.

Has anyone (a) got local fixes for any of these problems, or (b) found a
superior Java HTML parser out there?

View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message