tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Date Thu, 12 Dec 2019 19:39:43 GMT
I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010
<https://issues.apache.org/jira/browse/TIKA-3010>

And a WIP progress PR is at https://github.com/apache/tika/pull/305

My thought is to put something together that mimics how we deploy Solr, and see how that works.
  I have a need for an install process that a general IT person can follow, who isn’t a
Tika expert or a Docker users.




> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org> wrote:
> 
> Thanks for bringing this conversation up Eric.
> 
> 
> 
> Historically if you look over the last 5 years, I think what you are asking below has
sort of already become the de facto
> truth. Most people are in fact using Tika server, whether they are individual devs, govvies,
commercial folk and the like. 
> 
> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty much
every PL I know and use of 
> actively today.
> 
> 
> 
> Given that, we probably should update the main website docs to make this more prominent.
The tika server docs on the
> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful
if someone wants to update the 
> website to make it more prominent.
> 
> 
> 
> The downstream Tika Python lib that I maintain has tons of activity is used by more than
350+ projects and relies solely
> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the 2014
DARPA MEMEX days was to 
> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
> 
> 
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Eric Pugh <epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>
> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>" <dev@tika.apache.org
<mailto:dev@tika.apache.org>>
> Date: Wednesday, December 4, 2019 at 12:24 PM
> To: "tika-dev@apache.org <mailto:tika-dev@apache.org>" <tika-dev@apache.org
<mailto:tika-dev@apache.org>>
> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika Server
in production?
> 
> 
> 
> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
> 
> 
> 
> Over in Solr land there has been renewed discussion about streamlining what Solr is....
  
> 
> 
> 
> In regards to rich content extraction and the Tika project, it seems like the two ideas
that continue to preserve the existing behavior are:
> 
> 
> 
> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   This slims
down the standard Solr download, and *might* make it easier to update the version of Tika
+ dependent jars used?
> 
> 
> 
> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633)
and just have Solr delegate the call to Tika-Server.
> 
> 
> 
> 
> 
> I was thinking about why I like option 1 better than 2, and I think it boils down to
how mature the IT organization I am working with is.  Some IT organizations have large dev-ops
teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes with
Load Balancer dynamically scaling up and down is simple and second nature!  However, many
organizations aren’t like that.
> 
> 
> 
> So I guess what I’m asking is do we have a reasonable supported approach for deploying
Tika Server for non-tika savvy organizations?   I’m thinking about Solr, and specifically
the fact that Solr has a well defined set of Service Installation scripts.   When I follow
the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
I can feel confident that when the server is rebooted, then Solr will come back up!   Plus
there is log rotation and all the rest.
> 
> 
> 
> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm
pagel, the message is to run Tika as a command line application, or embedded in your application.
  
> 
> 
> 
> I’m wondering if Tika-Server needs to be made more prominent, and treated as the “primary
method of interacting with Tika”?   Do we need as a community to focus more on Tika-Server?
  In our getting started documentation, in our usage documentation, and in our examples?
> 
> 
> 
> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
  
> 
> 
> 
> Wanted to stoke the discussion!
> 
> 
> 
> Eric
> 
> 
> 
> _______________________
> 
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>
| My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>  
> 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
      
> 
> This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message