tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: [DISCUSS] A more modular parser project
Date Mon, 24 Aug 2015 14:14:01 GMT

Apologies for my delay.  If anyone else wants to chime in on potential subgroupings of parsers,
that'd be great!

>>So just to understand the break downs.  When you say:

	tika-office-parser-bundle/ (including microsoft, opendocument, pst, rtf, iwork? Has dependency
on html/text)
                tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, feed,
iptc, crypto, etc?)/
	tika-sourcecode-parser-bundle (parsers that handle source code)
	tika-package-parser-bundle (all zip/tar/etc)

>>Does that indicate 6 bundles?  5 individuals that could wrap into 1 uber jar?  

Y, this is pure strawman, without knowledge of OSGi conventions/best practices.  So, guidance
on best practices is very much needed!  My thought, which may be crazy in an OSGi environment,
was to have 6 bundles that consumers could use: the overall tika-classic or any combination
of the 5 child bundles.  For the most basic use case, y, there'd be one uber-jar for tika-classic.

>>Breaking things down at different levels will add to maintenance effort so it may
be better to start with the broad strokes like tika-classic-parser-bundle.  
Yes, makes sense.  I broke out the pdf parser, for example, only because I heard that there
are some users who really only want that.  Any division/subdivision of this sort will be based
on subjective criteria, I think.  More feedback from the community will be important on these

>>I think this approach is fine but it does mean we're taking an opinion on what most
of Tika's basic users want for simple usecases.
Y. I agree.  My simple use case will be different from someone else's.  

>>Another approach could be grouping the parsers by similar dependencies which I think
the tika-multimedia-parser-bundle does fairly well.  From a dependence management perspective
this is desirable. 

I really like this because it is defensible and predictable, and y, from a dependency management
perspective, this is great.  However, from a use case perspective, I think users will care
much more about file types handled than jars required.  I could very well be wrong though.
 I have to look more carefully at the dependency report link that you sent...I suspect that
there may be some clear overlap between file formats and dependencies that may be enough to
drive this type of configuration (logging and commons-x aside :) ).

>>With respect to bundles that don't fit perhaps those live on their own until an obvious
emerges.  It's much harder to remove something from a bundle than to add it later.  I think
this may apply to native bundles too.

Oh, very good to know.  Thank you.



On 8/4/2015 8:32 AM, Allison, Timothy B. wrote:
> Bob,
>    Thank you, again.  This looks promising at first glance!
> To continue down the strawman path and to start discussion on the elephant in the room...
> We'd want bundles that allow enough control for users but aren't too much of a hassle
to configure.  There will be trade-offs.
> So, what do we think of this strawman for proposed bundles:
> tika-classic-parser-bundle/
> 	Tika-office-parser-bundle/ (including microsoft, opendocument, pst, rtf, iwork? Has
dependency on html/text)
> 	Tika-pdf-parser-bundle/
>                  Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml,
feed, iptc, crypto, etc?)/
> 	Tika-sourcecode-parser-bundle (parsers that handle source code)
> 	Tika-package-parser-bundle (all zip/tar/etc)
> tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, audio, audio+video
> 	Tika-image-parser-bundle
> 	Tika-image-ocr-parser-bundle
> 	Tika-audio-parser-bundle
> 	Tika-video-parser-bundle
> tika-scientific-parser-bundle/ (all parsers that handle scientific 
> data sets (grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much 
> hand-waving...input, Chris?)
> tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all 
> parsers that rely on native libs...unfortunately, this doesn't fit 
> well thematically...)
> tika-advanced-bundle/ (all parsers that rely on nlp or other advanced techniques for
extraction of information...
> 		these aren't really just pulling text and metadata out, but are operating on the text/metadata
> 		 once it has been pulled out.  We may need separate bundles for each?)
> 	Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
> 		...or maybe we want separate bundles for each?)
> 	Tika-sentiment-parser-bundle (imaginary...?)
> 	Tika-object-parser-bundle
> Where to put?
> 	 font parser
> 	executable
> 	mat
> 	prt
> 	strings
> Cheers,
>                 Tim
> -----Original Message-----
> From: Bob Paulin [mailto:bob@bobpaulin.com]
> Sent: Tuesday, August 04, 2015 8:56 AM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] A more modular parser project
> So I just tried adding a 
> META-INF/services/org.apache.tika.parser.Parser
> file to each bundle in the straw man implementation and it seemed to 
> do the trick. Looks like the ServiceLoader code searches the 
> classloader for all of these files and iterates through them to pick 
> up each jar's META-INF/services/org.apache.tika.parser.Parser entries 
> and adds them to the list.  I've updated the code on github to include one per bundle.
> This might be the way to go.
> ex.
> https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-
> image-parser-bundle/src/main/resources/META-INF/services
> - Bob
> On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:
>>>> +1 to moving the source to bundles.  I think for a 2.0 would be 
>>>> +easier
>> to consolidate into a parser uber jar than trying to tease things out 
>> like I did in the straw man impl. However deciding how to break 
>> things up might take some experimentation.
>> Y, and the strawman is a great easy entry down this path towards 2.0.  I think the
main hangup will be coming to consensus about granularity and nature of the packages, but
we can burn that bridge when we get to it.  There are some dependencies between parsers, but
we can work through that.
>>>> 1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
>> consider moving this up to core).
>> Y, I put that in tika-parsers because it relies on commons codec, and I wanted to
keep that dependency out of tika-core.  But, I'm willing to add it to tika-core if there aren't
>>>> 2) Since the META-INF/services/org.apache.tika.parser.Parser is in
>> tika-parser we'd need to rethink the static ServiceLoader strategy to 
>> either always be dynamic or figure out a way to have each jar bring 
>> there own static loader.
>> Hmmm...is there a way to specify this in one overall tika-config file or in separate
configs in each bundle (yuck)...

View raw message