While parsing a certain page in Nutch, I am getting a java.lang.StackOverflowError
exception due to the recursion in HTMLMetaProcessor.getMetaTagsHelper.
A copy of the offending page is available at http://www.grok.in/tmp/f005.html When you look at the HTML source of that page, it is clear why the StackOverflowError occurs. HTMLMetaProcessor.getMetaTagsHelper uses recursion to go through the HTML tree stopping when it encounters a "body" tag. But this page does not have a body tag at all! Moreover this page does not end most of the HTML tags that it opens, thus creating a very deep tree.
Such pages, though uncommon, exist in plenty on the Web. Ideally, Nutch should not choke like this on encountering them. One option is to use something like java.lang.LinkedList as a queue to traverse the tree without using recursion. This is how I am currently avoiding the problem. If this approach is acceptable, I can open a Jira issue and submit a patch.