tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: [EXTERNAL] Tika Python questions
Date Tue, 08 Oct 2019 16:45:56 GMT


Thanks for your question. Yes, the same way you set the byte size property in Tika-App (I
think through
parser configuration) is how you would do it for Tika-Server. You would just start the Tika
Server yourself
with a custom config file that set this property and then start it on the default port (making
sure any other
ones were killed first). Then Tika-Python will use your own Tika Server with custom config.


As for catching errors, it will try its best to do that, but it does not catch all of them
and if you find
something it doesn’t catch let us know and we will work to fix it.








From: "hans.meijer@avident-it.se" <hans.meijer@avident-it.se>
Organization: Avident-IT
Date: Tuesday, October 8, 2019 at 6:06 AM
To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
Subject: [EXTERNAL] Tika Python questions



I have had the pleasure of testing the Tika-python library. I am testing it out in a new application
that are developed for customers.

It has very good performance, especially for parsing XLSX and XLS files.


However, I have two questions:
The Tika-Server handles only files with a maximum byte size. I get this error:
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but
1000000 is the maximum for this record type.

increasing the maximum allowable size for this record type.

As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

I have tried the Tika-App python (jar file) and it does handle the file size where files are
larger than 1000000. 

In the Tika documentation it says to set MaxBytes to -1 to override and handle larger files.

Is there any way to handle this via Tika-Python? To set max files size to unlimited as the
“Tika-App” handles it?

How is it possible to catch errors via the Tika-python library, like if files are encrypted,
corrupt etc.?


Kind regards




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message