tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TIKA - how to read chunks at a time from a very large file?
Date Fri, 29 Aug 2014 13:52:19 GMT
My belief in making that recommendation was that a given document wouldn't split a word across
an "element".  I can, of course, think of exceptions (word break at the end of a PDF page,
for example), but generally, my assumption is that this wouldn't happen very often.  However,
if this does happen often with your documents, or if a single element is too large to hold
in memory, then that recommendation won't work, and you'll probably have to write to disk.

From: ruby [rshossain@gmail.com]
Sent: Thursday, August 28, 2014 3:26 PM
To: tika-dev@lucene.apache.org
Subject: Re: TIKA - how to read chunks at a time from a very large file?

If I extend the ContentHandler then is there way to make sure that I don't
split on words?

View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.
View raw message