tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject FW: [DISCUSS] A more modular parser project
Date Tue, 04 Aug 2015 13:35:14 GMT
Bob,
  Thank you, again.  This looks promising!

To continue down the strawman path and to start discussion on the elephant in the room...

We'd want bundles that allow enough control for users but aren't too much of a hassle to configure.
 There will be trade-offs.

So, what do we think of this strawman for proposed bundles:

tika-classic-parser-bundle/
	Tika-office-parser-bundle/ (including microsoft, opendocument, pst, rtf, iwork? Has dependency
on html/text) 
	Tika-pdf-parser-bundle/
                Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, feed,
iptc, crypto, etc?)/
	Tika-sourcecode-parser-bundle (parsers that handle source code)
	Tika-package-parser-bundle (all zip/tar/etc)

tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, audio, audio+video
files)
	Tika-image-parser-bundle
	Tika-image-ocr-parser-bundle
	Tika-audio-parser-bundle
	Tika-video-parser-bundle

tika-scientific-parser-bundle/ (all parsers that handle scientific data sets 
	(grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, Chris?)

tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers that rely on
native libs...unfortunately, this doesn't fit well thematically...)

tika-advanced-bundle/ (all parsers that rely on nlp or other advanced techniques for extraction
of information...
		these aren't really just pulling text and metadata out, but are operating on the text/metadata
		 once it has been pulled out.  We may need separate bundles for each?)
	Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
		...or maybe we want separate bundles for each?)
	Tika-sentiment-parser-bundle (imaginary...?)
	Tika-object-parser-bundle
	
Where to put these?
	 font parser
	executable
	mat
	prt
	strings


Cheers,
 
               Tim



-----Original Message-----
From: Bob Paulin [mailto:bob@bobpaulin.com] 
Sent: Tuesday, August 04, 2015 8:56 AM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] A more modular parser project

So I just tried adding a META-INF/services/org.apache.tika.parser.Parser 
file to each bundle in the straw man implementation and it seemed to do 
the trick. Looks like the ServiceLoader code searches the classloader 
for all of these files and iterates through them to pick up each jar's 
META-INF/services/org.apache.tika.parser.Parser entries and adds them to 
the list.  I've updated the code on github to include one per bundle.  
This might be the way to go.

ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services


- Bob

On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:
>>> +1 to moving the source to bundles.  I think for a 2.0 would be easier
> to consolidate into a parser uber jar than trying to tease things out
> like I did in the straw man impl. However deciding how to break things
> up might take some experimentation.
>
> Y, and the strawman is a great easy entry down this path towards 2.0.  I think the main
hangup will be coming to consensus about granularity and nature of the packages, but we can
burn that bridge when we get to it.  There are some dependencies between parsers, but we can
work through that.
>
>>> 1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
> consider moving this up to core).
> Y, I put that in tika-parsers because it relies on commons codec, and I wanted to keep
that dependency out of tika-core.  But, I'm willing to add it to tika-core if there aren't
objections.
>
>
>>> 2) Since the META-INF/services/org.apache.tika.parser.Parser is in
> tika-parser we'd need to rethink the static ServiceLoader strategy to
> either always be dynamic or figure out a way to have each jar bring
> there own static loader.
>
> Hmmm...is there a way to specify this in one overall tika-config file or in separate
configs in each bundle (yuck)...
>

Mime
View raw message