nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddhartha Reddy" <s...@grok.in>
Subject java.lang.StackOverflowError in HTMLMetaProcessor.getMetaTagsHelper
Date Fri, 13 Jun 2008 05:34:58 GMT
While parsing a certain page in Nutch, I am getting a
java.lang.StackOverflowError exception due to the recursion in
HTMLMetaProcessor.getMetaTagsHelper.

A copy of the offending page is available at
http://www.grok.in/tmp/f005.html When you look at the HTML source of that
page, it is clear why the StackOverflowError occurs.
HTMLMetaProcessor.getMetaTagsHelper uses recursion to go through the HTML
tree stopping when it encounters a "body" tag. But this page does not have a
body tag at all! Moreover this page does not end most of the HTML tags that
it opens, thus creating a very deep tree.

Such pages, though uncommon, exist in plenty on the Web. Ideally, Nutch
should not choke like this on encountering them. One option is to use
something like java.lang.LinkedList as a queue to traverse the tree without
using recursion. This is how I am currently avoiding the problem. If this
approach is acceptable, I can open a Jira issue and submit a patch.

Best,
Siddhartha

Mime
View raw message