tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TIKA - how to read chunks at a time from a very large file?
Date Thu, 28 Aug 2014 18:49:32 GMT
Probably better question for the user list.

Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.

Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler?

-----Original Message-----
From: ruby [mailto:rshossain@gmail.com] 
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-dev@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 

View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

View raw message