tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Paulin <...@apache.org>
Subject Tika 2.0 Modules first pass.
Date Wed, 06 Jan 2016 03:54:08 GMT

I took a stab at the initial module structure based on Tim and my email 
[1].  If a package didn't seem to fit with anything else I created an 
individual project for it.  If any of the groupings don't make sense or 
folks think there are better ways to organize I'm happy to move stuff 
around.  Patches are welcome :).  I have a JIRA created [2].  Commited 
with rev 1723223.

There's still a good amount of outstanding work:
1) All this could use more testing.  Especially with the external parsers.
2) As Tim has already raised there is the issue of dual maintaining 
branches.  There are likely some fixes in trunk that are not currently 
applied to the 2.0 branch.
3) The tika-parser project is currently using the maven shade plugin and 
that is causing issues creating the OSGi Manifest.MF file.  I should be 
able to find a way around this.
4) Still need to recreate the OSGi uber jar with all dependencies 
packaged with the tika code.
5) There are still some classes in the tika-parser project.  Should 
these all be moved to core? A common project?...
6) Documentation.  I could use some Wiki access.  Username: BobPaulin.
7) There are some dependencies in the tika-parser project that were not 
needed to compile any of the individual modules or run tests. Are they 
still needed?
8) Where does the 
org.apache.tika.parser.external.CompositeExternalParser ServiceLoader 
(META-INF/services/org.apache.tika.parser.Parser) config belong.  I 
moved it to tika-core since that is where the class lives.
9) Subcomponent licenses.  I moved them to the modules they belong in 
but I need to figure out a way to make them bubble up to the uber jars.  
Or perhaps they need to be dual maintained.
10) Anything I may be forgetting....;)

For the most part all the changes just to organize the existing 
packages.  There are a handful of changes to the test suite in order to 
break some cyclical dependencies.  Here's an overview of how the 
projects interrelate at the moment:

  - /tika-advanced-module
  - /tika-cad-module
            -> tika-text-module [test]
  - /tika-code-module
            -> tika-text-module [test]
  - /tika-database-module
            -> tika-office-module [test]
  - /tika-ebook-module
            -> tika-text-module
  - /tika-journal-module
            -> tika-pdf-module
  - /tika-multimedia-module
            -> tika-web-module [test]
            -> tika-office-module [test]
            -> tika-pdf-module [test]
  - /tika-office-module
            -> tika-web-module [test]
            -> tika-package-module [test]
            -> tika-text-module [test]
  - /tika-package-module
  - /tika-pdf-module
           -> tika-text-module [test]
           -> tika-package-module [test]
           -> tika-office-module [test]
  - /tika-scientific-module
           -> tika-text-module [test]
  - /tika-text-module
           -> tika-text-module [test]
           -> tika-package-module [test]

Very interested in feedback since we have been talking about this for a 
bit but I'm sure actually seeing it will create more discussion. Looking 
at how much simpler the individual pom files does seem to demonstrate 
that this will be a good thing for the project.


- Bob

[2] https://issues.apache.org/jira/browse/TIKA-1824

View raw message