tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: ContentHandler's OutputStream
Date Thu, 05 Feb 2009 09:22:34 GMT

On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren <jonathan@soe.ucsc.edu> wrote:
> What I really want is someone to tell me how to get back a usable stream of
> plaintext, whether this involves a radical change to Tika's ContentHandler
> class or some trick with Java, I really don't care, as long as it's single
> thread save.

Have you looked at the ParsingReader class? It seems like a perfect
match to your needs. The ParsingReader class fires a background thread
to do the parsing and pipes the output so you can control when and how
you want to read the extracted text.

Alternatively, if the extra thread is not acceptable, you implement a
custom ContentHandler that directly catches and processes the
characters() and ignorableWhitespace() events.

Or you could subclass Writer and treat the write() calls as callbacks
from the parser.


Jukka Zitting

View raw message