tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements
Date Fri, 13 Aug 2010 08:34:50 GMT

On Thu, Aug 12, 2010 at 8:27 PM, Ken Krugler
<kkrugler_lists@transpac.com> wrote:
> I think I'm missing something - which javadocs are your referring to here?
> What I see for startDocument() is:
>    /**
>     * Starts an XHTML document by setting up the namespace mappings.
>     * The standard XHTML prefix is generated lazily when the first
>     * element is started.
>     */

I guess the "standard XHTML prefix" is a bit vague here... Mea culpa.
The intention was that XHTMLContentHandler would provide everything up
to the opening <body> tag when startDocument() is called.

> I saw your note on the issue in Jira:
> [...]
> This would work for <meta>, but not <link> or <base>.

I'd argue that we shouldn't output the <base> element. Instead we
should normalize all URLs before giving them out to the client.

I agree with your point with <link> though. My solution doesn't
address that case.

> In the short term, since this is a blocker for a project I'm working on, I
> plan to slightly modify XHTMLContentHandler to allow it to work properly
> with <head> elements (specifically, meta/link/base).

Go for it! You're the one with the itch and the cycles to implement a
solution, so in the end it's your call on how to do this.


Jukka Zitting

View raw message