tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: How is WriteOutContentHandler supposed to work?
Date Tue, 20 Nov 2007 12:52:59 GMT
Hi,

On Nov 20, 2007 9:14 AM, Niall Pemberton <niall.pemberton@gmail.com> wrote:
> Apologies if this is a stupid question, but I don't understand
> WriteOutContentHandler[1] - shouldn't it be implementing the
> startElement(), endElement() etc. methods?

There are a lot of use cases where a client is only interested in the
plain text content of the document without any of the structuring
encoded in the XHTML SAX events generated by a parser. The
WriteOutContentHandler was designed to support those use cases as a
simple and fast way to translate the SAX event stream to a character
stream that only contains text from the parsed document.

You can use a standard SAX TransformerHandler if you want to serialize
the full generated XHTML document.

BR,

Jukka Zitting

Mime
View raw message