lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
Date Mon, 20 Jun 2016 13:49:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339513#comment-15339513
] 

Tim Allison commented on SOLR-7632:
-----------------------------------

Given the effort that [~thetaphi] and [~lewismc] just went through to upgrade to Tika 1.13...I
think we might want to pick up work on this issue again.

To carry out [~ehatcher]'s recommendation...I don't know if we'd need CORS for this or not,
but it might be neat to modify Tika's server to allow users to inject their own resources=endpoints
via a config file and an extra jar.  Within the Solr project, we'd just have to implement
a resource that takes an input stream, runs Tika and then adds a SolrInputDocument.

For simplicity, it will take some effort on the Solr devs' side to figure out how to start
and stop at least one tika-server seamlessly so that the "getting started" user doesn't have
to do a thing.

For scaling, one could imagine users configuring multiple tika-servers, and the handler randomly
selecting which tika-server to hit (I'm sure there are better strategies, but random selection
could get us started).

I'm more than happy to contribute on the Tika side and on some of the integration with Solr
side.  Any takers among the Solr devs? 

Overall, is this the right direction?  Is this worth the effort given the number of other
options for ETL into Solr?

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7632
>                 URL: https://issues.apache.org/jira/browse/SOLR-7632
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>              Labels: memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika fails it
messes up the ExtractingRequestHandler (e.g., the document type caused Tika to fail, etc).
A more reliable way and also separated, and easier to deploy version of the ExtractingRequestHandler
would make a network call to the Tika JAXRS server, and then call Tika on the Solr server
side, get the results and then index the information that way. I have a patch in the works
from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message