tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: [EXTERNAL] Tika Python questions
Date Wed, 09 Oct 2019 12:04:06 GMT
Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see
what record type you're running into this with, we may be able to
increase it in POI before the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <lfcnassif@gmail.com> wrote:
>
> I think it is not related to file size, but maximum record size handled by
> POI. It is a protection against OutOfMemoryErrors. I increased this limit
> to 10M because was seeing many of them. I do not know if it is configurable
> in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <mattmann@apache.org>
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size property
> > in Tika-App (I think through
> > parser configuration) is how you would do it for Tika-Server. You would
> > just start the Tika Server yourself
> > with a custom config file that set this property and then start it on the
> > default port (making sure any other
> > ones were killed first). Then Tika-Python will use your own Tika Server
> > with custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does not
> > catch all of them and if you find
> > something it doesn’t catch let us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "hans.meijer@avident-it.se" <hans.meijer@avident-it.se>
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am testing
> > it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> > length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the file
> > size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override and
> > handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files size to
> > unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like if
> > files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >

Mime
View raw message