tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml
Date Thu, 31 Oct 2019 14:05:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964058#comment-16964058

Tim Allison commented on TIKA-2972:

I'm somewhat struggling with 2 use cases.  The first is what I opened the issue for...apply
a custom handler factory generally to effectively customize the postprocessing of text extracted
to add to the document's metadata...with the output/final disposition being roughtly what
it is, e.g. a json file for tika-batch or a json response for rmeta/.

The second use case is more in line with SOLR-7632, where the output is actually sent by the
custom handler over the network as a Solr document on {{endDocument()}}, so there is no file
written, no outputstream needed, no content-ful json response.

I'm wondering now if we should keep this issue as it is about custom parts of handlers being
customized, but output going where it normally does, but open a second issue to allow for
custom endpoints in tika-server that could, say, send the data to Solr/ES directly, rather
than returning the parsed data to the client to then resend to Solr/ES or ...

> Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml
> -------------------------------------------------------------------------------
>                 Key: TIKA-2972
>                 URL: https://issues.apache.org/jira/browse/TIKA-2972
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
> I'd like to add a tika-eval handler that will calculate text stats at the end of parsing
a document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary,
etc. in the metadata rather than having to run their own post-parse process on the content.
> The problem comes with integrating this into tika-app and tika-server -- tika-app balloons
to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some
stuff that very few folks will use.
> I think we've discussed this option before, but it would be handy to allow users to specify
a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so
that users can get custom handling in tika-app and tika-server.
> The idea of a map of ContentHandlerFactories, would be to have a name for each content
handler factory, and a user could call different handlers on tika-server like this:
> -{{curl... http://localhost:9998/tika/custom/myhandler1}}-
> -{{curl... http://localhost:9998/tika/custom/myhandler2}}-
> That's not right because we'd want to differentiate classic Tika parsing and the RecursiveParserWrapper...
> {{curl... http://localhost:9998/tika/myhandler1}}
> {{curl... http://localhost:9998/tika/myhandler2}}
> {{curl... http://localhost:9998/rmeta/myhandler1}}
> {{curl... http://localhost:9998/rmeta/myhandler2}}
> or in tika-app:
> {{java -jar tika-app.jar --handlerFactory=myhandler1...}}

This message was sent by Atlassian Jira

View raw message