tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Paulin <...@bobpaulin.com>
Subject Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils
Date Mon, 28 Mar 2016 01:02:02 GMT
Hi Nick,

On 3/27/2016 6:52 PM, Nick Burch wrote:
> On Sun, 27 Mar 2016, Bob Paulin wrote:
>> Currently the Apache POI dependency is in several modules and it's 
>> sort of a beast (> 2 MB in size).
> You should've seen it before Jukka and Yegor spent a crazy ApacheCon 
> hacking up the ooxml-lite support... ;-)
I can only imagine.
>> It appears many of the modules are only using the IOUtils library.
> I suspect a strong overlap with the parser classes I've helped write...
>> Any concerns with replacing this POI stuff with commons-io? Does POI 
>> offer anything above the commons-io functionality in IOUtils? If not 
>> I think it would be great to isolate the poi dependency to the office 
>> module only.
> A lot of the use is for endian-specific reading of numbers and 
> strings. Might be a bit of stream stuff, but mostly that can be passed 
> off to the Tika IO utils classes.
Didn't even think of looking at Tika IO but yes that would be even better.
>> From a quick check, I can't see any endian number stuff in commons 
>> IO, but 
> I might of missed it, or it might be in a different commons module. If 
> not, there might be something to be said for popping that POI logic 
> along with some of the Ogg-Vorbis utils stuff (another one with my 
> grubby mits all over it) into a more helpful general utils grouping
Yes I think overall if these functions can live in somewhere either 
inside tika or a smaller dependent library we're in a better place. I'll 
take a look at Ogg-Vorbis.

> Nick

View raw message