tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject ContentHandler's OutputStream
Date Thu, 05 Feb 2009 02:02:02 GMT
Let me preface my remarks by saying, I'm mystified how to use  
ContentHandler to do anything complicated.

It seems like the semantics of getting the content out of a  
ContentHandler is wrong, or at least shortsighted.  The user has two  
options on how to use the text provided by ContentHandler.  The user  
can provide an OutputStream, which ContentHandler will write() the the  
bytes to in as it reads the InputStream associated with the file, or  
the user can have ContentHandler buffer the entire parsed contents of  
the file into memory and then get back a humungous String via  
ContentHandler.toString() .

There needs to be a better way.

Writing the bytes to an OutputStream pretty much locks the bytes up so  
that the only thing you can do is write them to some sort of device  
whether it's the console, disk, or a network connection.  Buffering  
the entire file is simply a not an option for very large files.  For  
very large files, you need to process chunks of the file, like from a  
stream, or better yet, a series of callbacks with a relatively small  
buffer (say even a few megs).  (This is how SAX does it.)   By using a  
callback system, the user is free to do whatever he/she wants to do  
with each chunk.  If he/she wants to blast it to the disk, a simple  
OutputStream.write(buf) is good enough.  If they want to do some more  
parsing of the text (like I want to do) then he/she can that as well  
without reading the entire file into memory.

Here's my scenario that prompted this email:

I'm reading a bunch of files of a variety of types.  Some of these  
files can be quite large.  Like gigabytes.  I'm using AutoDetectParser  
to handle the approrpriate parsing and BodyContentHandler to extract  
out the plaintext.  I want to take the extracted plaintext, do some  
analysis on it, and then index the plaintext along with results of my  
analysis.  Specifically, my analysis requires taking the extracted  
plaintext, segmenting it into sentences and doing part of speech  
tagging and morphalogical analysis (ie stemmming) via an external  
process.  This mean I can't use an OutputStream since you can't read  
from an OutputStream, so I'm stuck with using  
ContentHandler.toString() which can (and does) exhaust memory for  
large files.

What I really want is someone to tell me how to get back a usable  
stream of plaintext, whether this involves a radical change to Tika's  
ContentHandler class or some trick with Java, I really don't care, as  
long as it's single thread save.  (Java's PipedInputStream and  
PipedOutputStream are not single thread safe.)

I know I can't be only one that's had or will have this problem.  It  
really seems like this use case needs to be handled, because the use  
case that Tika currently seems to be designed for is "Write plaintext  
to the disk."


Jonathan Koren

View raw message