tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jana, Kumar Raja" <kj...@ptc.com>
Subject Limiting the extracted content
Date Mon, 28 Jun 2010 13:49:58 GMT

We use Apache Tika in our application before sending the content to Solr
for Indexing. Some of our documents are pretty large (over 150 MB in
size with "only text" content over 30 MB). Processing such documents
often result in Out of Memory Exceptions during runtime. Ofcourse,
increasing the max heap does resolve this issue and another option we
use is to index in chunks of 5 MB. 


On careful analysis, we realized that most of our keywords lie in the
first 1-2 MB of such documents and indexing that chunk suffices our
requirement. Is there any provision in Tika APIs to extract only the
first 1 or 2 MB (customizable) of the content instead of parsing the
entire document? If not, can someone point to which part of the code I
can play with to implement this?




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message