nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Neko parsing fix inadvertently reverted?
Date Thu, 17 Aug 2006 20:24:17 GMT
Sami Siren wrote:
> Benjamin Higgins wrote:
>> Comments?
>
> I cannot comment on the issue itself, but if you can submit a patch 
> (perhaps with testcase that demonstrates this) then it will be easier 
> to  act on.

Benjamin,

Could you please send me a copy of the offending HTML for testing (off 
the list)?

A little background: I knew of this issue when I changed the API to use 
DocumentFragment. However, as far as I was able to test it with the most 
recent version of Neko at that time, it didn't exhibit this problem.

The main motivation for this was to enable better parsing of broken 
documents with multiple <html> tags (or no <html> at all, but <head> and

<body> as "root" elements). While this is not possible using a Document, 
it is possible to do this using a DocumentFragment (which doesn't 
necessarily have to represent any well-formed XML tree; and 
specifically, it doesn't require that there is a single root node - 
please see the Javadoc of org.w3c.dom.DocumentFragment for longer 
explanation).

So, if we change it back to Document we will lose this functionality, 
and some pages will be severely truncated, because in such cases 
NekoHTML takes only the first "pseudo-root" node and discards all 
others. However, if you are dealing mostly with well-formed documents 
you may not need this ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message