tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements
Date Thu, 12 Aug 2010 02:19:51 GMT
Hi all,

Digging deeper, the current behavior seems to be causing problems that  
were not evident in Tika 0.7. We noticed this when switching the Bixo  
code to use Tika 0.8-SNAPSHOT.

For example, if you have a document that looks like:

		<meta http-equiv="content-type" content="text/html; charset=utf-8">
		<title>Some Title</title>

The lazyStartDocument() method is called when the <meta> tag is  
encountered by HtmlHandler, because it calls xhtml.startElement() with  
the meta tag.

Since this is before <title> has been seen, the output generated has  
an empty <title> element. And that causes a bunch of problems for our  

I believe this (and the previous problem I'd reported) is a side- 
effect of TIKA-379, which Chris M. rolled in during change 949635.

Unfortunately I think lazyStartDocument() needs to be re-thought. A  
rough proposal would be:

1. HtmlHandler should call xhtml start/endElement for all elements,  
versus creating a fragile implicit dependency between its behavior and  
that of XHTMLContentHandler.

2. In XHTMLContentHandler, the elements received should be queued up  
until endElement() is called for <head>, or startElement() is called  
for <body>, or endDocument() is called.

-- Ken

On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:

> Hi all,
> I was trying to debug why my fix for a problem with the Boilerpipe  
> integration wasn't working, and came across  
> XHTMLContentHandler.lazyStartDocument().
> This, when used by HtmlHandler, essentially skips calling the user- 
> provided content handler for the initial element tags (html, head,  
> body) until it looks like there's a reason to generate content. Then  
> it calls the content handler with no-attribute versions of these  
> elements, so attributes in elements like <html lang="en"> will get  
> stripped. Which seems like not a great thing, especially given  
> ongoing work to make it easier to pass everything through if that's  
> what's needed.
> But the problem I ran into was with this sequence:
> <html>
> 	<head>
> 		<title>xxx</title>
> 		<meta blah>
> 	</head>
> 	<body>
> 	...
> 	</body>
> </html>
> The problem is that this call to lazyStartDocument()is made when the  
> <meta> element is encountered. So what the content handler gets  
> called with is:
> <html>
> 	<head>
> 		<title>xxx</title>
> 	</head>
> 	<body>
> and then <meta>
> So the <meta> element is getting passed through after the <body>  
> element. And that in turn prevents Boilerpipe from behaving as  
> expected.
> But before I dive in here and start filing issues/hacking on the  
> code, I'm wondering if somebody (OK, Jukka) can provide some color  
> commentary.
> Thanks,
> -- Ken
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message