My belief in making that recommendation was that a given document wouldn't split a word across
an "element". I can, of course, think of exceptions (word break at the end of a PDF page,
for example), but generally, my assumption is that this wouldn't happen very often. However,
if this does happen often with your documents, or if a single element is too large to hold
in memory, then that recommendation won't work, and you'll probably have to write to disk.
________________________________________
From: ruby [rshossain@gmail.com]
Sent: Thursday, August 28, 2014 3:26 PM
To: tika-dev@lucene.apache.org
Subject: Re: TIKA - how to read chunks at a time from a very large file?
If I extend the ContentHandler then is there way to make sure that I don't
split on words?
--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.
|