tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ruby <rshoss...@gmail.com>
Subject TIKA - how to read chunks at a time from a very large file?
Date Thu, 28 Aug 2014 18:06:54 GMT
Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 





--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Mime
View raw message