tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: XHTML Bean and corresponding content handler
Date Thu, 06 Aug 2009 15:06:35 GMT

On Tue, Aug 4, 2009 at 9:30 AM, Michael
Wechner<michael.wechner@wyona.com> wrote:
> String XHTMLBean.getHead().getMeta(XHTMLBean.DESCRIPTION)
> String XHTMLBean.getHead().getTitle()

These you can get from the Metadata object.

> String[] XHTMLBean.getBody().getParagraphs();

This is a bit troublesome as not all parsers produce paragraphs of
content. For example the Excel parser produces XHTML tables.

You can either get just the plain character stream using tools like
BodyContentHandler, or the full XHTML output as SAX events (which you
can serialize to a byte stream if you want). I'm not sure if there's
any reasonable intermediate content abstraction.


Jukka Zitting

View raw message