tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (Jira)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2972) Allow users to specify a ContentHandlerFactory in tika-config.xml
Date Tue, 22 Oct 2019 20:36:00 GMT
Tim Allison created TIKA-2972:

             Summary: Allow users to specify a ContentHandlerFactory in tika-config.xml
                 Key: TIKA-2972
                 URL: https://issues.apache.org/jira/browse/TIKA-2972
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison

I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a
document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary,
etc. in the metadata rather than having to run their own post-parse process on the content.

The problem comes with integrating this into tika-app and tika-server -- tika-app balloons
to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some
stuff that very few folks will use.

I think we've discussed this option before, but it would be handy to allow users to specify
a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so
that users can get custom handling in tika-app and tika-server.

The idea of a map of ContentHandlerFactories, would be to have a name for each content handler
factory, and a user could call different handlers on tika-server like this:

`curl... http://localhost:9998/tika/custom/myhandler1`
`curl... http://localhost:9998/tika/custom/myhandler2`

or in tika-app:

`java -jar tika-app.jar --handlerFactory=myhandler1...`



This message was sent by Atlassian Jira

View raw message