tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Meikle <loo...@gmail.com>
Subject Re: TIKA-1509 (2.x breaking parser change) - ready for first review!
Date Sun, 18 Mar 2018 21:47:05 GMT
Nice one Nick!  Will take a look this week.


On 14 March 2018 at 17:38, Nick Burch <nick@apache.org> wrote:

> Hi All
> As promised, I've finally had a go to try and implement my ideas for
> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
> breaking 2.x parser change
> My work so far is in this github branch, and is ready for review!
> https://github.com/apache/tika/tree/multiple-parsers
> It seems to work fine for the Fallback case, and for the Supplemental
> case. You can set a policy that controls how clashing metadata is handled,
> currently "first one to set a key wins", "last one to set a key wins",
> "ignore previous parsers", and "keep old and new unique values"
> I've also done a proof of concept for "pick best" case, to try running the
> text parser with a specified set of different charsets, capture the text
> from each, "pick the best" (hard coded 1st...) then run for real with that
> one.
> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> to claim to support, Tika Config XML friendly helper for the metadata clash
> policy, review ContentHandlerFactory signature and tweak if needed.
> Proposed breaking 2.x change - add second parse method that takes
> ContentHandlerFactory instead of ContentHandler, with most parsers getting
> that just grabbing a single one and using that as before
> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> I stop? Carry on? Modify it? Other?
> Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message