nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Higgins" <bhigg...@gmail.com>
Subject Neko parsing fix inadvertently reverted?
Date Fri, 11 Aug 2006 17:51:41 GMT
I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
accidentally removed.  See:

http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log

Specifically, in revision 160319, among other things, DOMFragmentParser was
changed to DOMParser, because, in the comment to that revision:

Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
For some reason, the DOMFragmentParser can be very slow with large
documents while the DOMParser has no problems with these.  Also added

a main() that permits easier debugging.


However, in 179436, a big patch that included TagSoup among other things,
the change to DOMParser seems to have been lost.

I bring this up because I am having the exact same problem as described in
NUTCH-17.  I am using Neko 0.9.4.  It occurs on some particularly long
documents.  The fetcher simply hangs.  If I wait a few hours it will resume
again.  The HTML is nothing special; in fact, it's just a bunch of text
(html escaped ie < > & chars converted) inside a <pre> tag.

Comments?

Ben

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message