tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luís Filipe Nassif <lfcnas...@gmail.com>
Subject Re: [EXTERNAL] Tika Python questions
Date Tue, 08 Oct 2019 21:10:02 GMT
I think it is not related to file size, but maximum record size handled by
POI. It is a protection against OutOfMemoryErrors. I increased this limit
to 10M because was seeing many of them. I do not know if it is configurable
in tika server.

Regards,
Luis

Em ter, 8 de out de 2019 17:46, Chris Mattmann <mattmann@apache.org>
escreveu:

> Hi,
>
>
>
> Thanks for your question. Yes, the same way you set the byte size property
> in Tika-App (I think through
> parser configuration) is how you would do it for Tika-Server. You would
> just start the Tika Server yourself
> with a custom config file that set this property and then start it on the
> default port (making sure any other
> ones were killed first). Then Tika-Python will use your own Tika Server
> with custom config.
>
>
>
> As for catching errors, it will try its best to do that, but it does not
> catch all of them and if you find
> something it doesn’t catch let us know and we will work to fix it.
>
>
>
> Thanks,
>
> Chris
>
>
>
>
>
>
>
>
>
> From: "hans.meijer@avident-it.se" <hans.meijer@avident-it.se>
> Organization: Avident-IT
> Date: Tuesday, October 8, 2019 at 6:06 AM
> To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
> Subject: [EXTERNAL] Tika Python questions
>
>
>
> Hi
>
> I have had the pleasure of testing the Tika-python library. I am testing
> it out in a new application that are developed for customers.
>
> It has very good performance, especially for parsing XLSX and XLS files.
>
>
>
> However, I have two questions:
> The Tika-Server handles only files with a maximum byte size. I get this
> error:
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 1186956, but 1000000 is the maximum for this record type.
>
> increasing the maximum allowable size for this record type.
>
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>
> I have tried the Tika-App python (jar file) and it does handle the file
> size where files are larger than 1000000.
>
> In the Tika documentation it says to set MaxBytes to -1 to override and
> handle larger files.
>
> Is there any way to handle this via Tika-Python? To set max files size to
> unlimited as the “Tika-App” handles it?
>
>
> How is it possible to catch errors via the Tika-python library, like if
> files are encrypted, corrupt etc.?
>
>
>
>
> Kind regards
>
>
>
> HANS MEIJER
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message