nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sami Siren <ssi...@gmail.com>
Subject Re: Anyone looked for a better HTML parser?
Date Tue, 16 Oct 2007 13:50:41 GMT
Doug Cook wrote:
> The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
> fatal, and I *know* Neko misses some text, sometimes entire documents,
> because it can't deal with pathological HTML.

Do you have urls of such bad content available to look at?

-- 
 Sami Siren

Mime
View raw message