tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <apa...@gagravarr.org>
Subject Re: Can some of tika-parsers module dependencies be made optional ?
Date Thu, 19 Jun 2014 19:22:08 GMT
On Thu, 19 Jun 2014, Ray Gauss wrote:
> The point of a tika-parsers-all artifact would be a single dependency 
> that re-aggregates everything so that downstream projects could work the 
> same way they do now and not worry about missing dependencies.
> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?

We already have users confused by the current split between tika-core and 
tika-parsers - see users list for example. We already have users confused 
by what dependencies they need with the current poms setup. Splitting is 
going to make that a lot worse. (POI, as a related example, sees plenty of 
confused users who've got mis-matched jars and problems. Splitting is 
going to make that a lot worse.)

We have previously tried pushing parsers out of the tika parser jar and 
into other jars, eg ones maintained by external groups, but on the whole 
it hasn't been a great success. Keeping them in sync, dealing with 
different cycles, applying updates, keeping them consistent, building in a 
sensible length of time, all of that would be harder with a pile of 

If we were to split out out to the level needed by some of the use cases 
mentioned, we'd have so many parser modules it'd be a nightmare to 
maintain, and would case problems mentioned above. (People in other 
threads have cautioned on these problems). If we split into just a handful 
of sub modules, then many of the uses cases mentioned still have to do 
work to pick out the bits they need

I still believe that the main use case of tika is "everything included", 
and especially that's the beginners use case, so I think we should focus 
on keeping that easy. Peeling out just some bits feels like an advanced 
use case to me, so I'd rather we put the requirement for effort onto those 
folks, rather than onto newbies and people on the typical uses. I'd 
therefore much rather we provide advanced docs/help on excluding some 
bits, rather than pull it out into a pile of different modules.

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message