nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cook <nab...@candiru.com>
Subject Anyone looked for a better HTML parser?
Date Mon, 15 Oct 2007 20:44:53 GMT


I've spent quite a bit of time working with both Neko and Tagsoup, and they
both have some fairly serious bugs:

Neko has some occasional hangs, and it doesn't deal very well with a fair
amount of "bad" HTML that displays just fine in a browser. 

Tagsoup is better in terms of handling "bad" HTML, but it has a pretty
serious bug in that HTML character entities are expanded in inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the form
http://www.foo.com/bar?x=1&sub=5 has problems: the &sub is interpreted as an
HTML character entity, and an invalid href is created.  John Cowan, the
author of Tagsoup, more or less said "yeah, I know, everybody mentions that,
but that's done at such a low level in the code it's not likely to get fixed
any time soon". (See a discussion of this and other issues at
http://tech.groups.yahoo.com/group/tagsoup-friends/message/838). 

The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
fatal, and I *know* Neko misses some text, sometimes entire documents,
because it can't deal with pathological HTML.

Has anyone (a) got local fixes for any of these problems, or (b) found a
superior Java HTML parser out there?

Doug
-- 
View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13221500
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message