tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niall Pemberton" <niall.pember...@gmail.com>
Subject Re: How is WriteOutContentHandler supposed to work?
Date Tue, 20 Nov 2007 13:14:43 GMT
On Nov 20, 2007 12:52 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
>
> On Nov 20, 2007 9:14 AM, Niall Pemberton <niall.pemberton@gmail.com> wrote:
> > Apologies if this is a stupid question, but I don't understand
> > WriteOutContentHandler[1] - shouldn't it be implementing the
> > startElement(), endElement() etc. methods?
>
> There are a lot of use cases where a client is only interested in the
> plain text content of the document without any of the structuring
> encoded in the XHTML SAX events generated by a parser. The
> WriteOutContentHandler was designed to support those use cases as a
> simple and fast way to translate the SAX event stream to a character
> stream that only contains text from the parsed document.
>
> You can use a standard SAX TransformerHandler if you want to serialize
> the full generated XHTML document.

OK thanks - is the document's title supposed to be written then? If it
is then why not the rest of the meta data? Also theres no separation
between the title and content start - which looks like a bug.

Niall

> BR,
>
> Jukka Zitting

Mime
View raw message