tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements
Date Thu, 12 Aug 2010 18:27:43 GMT

On Aug 12, 2010, at 12:43am, Jukka Zitting wrote:

> Hi,
>
> On Wed, Aug 11, 2010 at 4:53 AM, Ken Krugler
> <kkrugler_lists@transpac.com> wrote:
>> But before I dive in here and start filing issues/hacking on the  
>> code, I'm
>> wondering if somebody (OK, Jukka) can provide some color commentary.
>
> The rationale behind the lazy startup in XHTMLContentHandler is that
> many parsers don't yet have the document title metadata available when
> startDocument() is called. Instead of outputting an empty <title/>
> element, it's better to delay the startup to as late as possible.
>
> Now, more generally the contract of XHTMLContentHandler (see
> start/endDocument javadocs) is that the parser that feeds it should
> only output content that go *inside* the <body/> element. Feeding a
> full <html/> tree to an XHTMLContentHandler will cause trouble.

I think I'm missing something - which javadocs are your referring to  
here? What I see for startDocument() is:

     /**
      * Starts an XHTML document by setting up the namespace mappings.
      * The standard XHTML prefix is generated lazily when the first
      * element is started.
      */

and for endDocument():

     /**
      * Ends the XHTML document by writing the following footer and
      * clearing the namespace mappings:
      * <pre>
      *   &lt;/body&gt;
      * &lt;/html&gt;
      * </pre>
      */

> If you have a parser that wants to output a full <html/> tree along
> with extra <meta/> entries inside the <head/> element, you can always
> directly use the ContentHandler instance given as an argument to the
> parse() method.

I've opened TIKA-478.  Though working through the complex SAX event  
handling setup for HtmlParser has proven challenging.

Architecturally it feels like we need some major changes in the  
HtmlParser code to handle the somewhat conflicting goals of nice,  
normalized output with getting more content passed through to the user- 
provided content handler. Julien had proposed ways to let the  
HtmlMapper do more of the heavy lifting, to allow for better external  
control of processing, but that hasn't yet turned into a patch.

I saw your note on the issue in Jira:

> Oh, I see now where this problem with <meta/> elements is coming from.
>
> One reasonably clean way to solve this would be to disable the  
> output of <meta/> elements from HtmlHandler while keeping the code  
> that sets the respective Metadata entries. Then in  
> XHTMLContentHandler we'd modify the lazyStartDocument() method to  
> output not just the <title/> element but the full set of collected  
> metadata as <meta/> elements. We could also set the lang attribute  
> (or xml:lang?) of the <html/> element if the respective Metadata  
> entry is set.
>
> The nice thing about this solution would be that the inclusion of  
> metadata in <head/> would work also for other document types beyond  
> HTML.

This would work for <meta>, but not <link> or <base>.

I could add these as additional metadata, but that feels wrong.

In the short term, since this is a blocker for a project I'm working  
on, I plan to slightly modify XHTMLContentHandler to allow it to work  
properly with <head> elements (specifically, meta/link/base).

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message