tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Date Mon, 16 Dec 2019 18:42:48 GMT
Hi folks!

I’ve got a mostly completed PR for having install scripts for Tika Server, and I’m hoping
a committer will take a look at the PR, and give feedback (and ideally commit in time for
1.24!)

A couple of things:

1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>,
in fact I started with the Solr scripts.

2) I’ve deleted all the Solr specific aspects (I think), however there may still be more
to delete.   

3) This requires a change to how we release Tika, previously we ship tika-app.jar and Tika-eval.jar,
and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz and tika-server-bin.zip
binary distributions.

I’m happy to start writing accompanying “how to deploy Tika Server” docs if this PR
looks good!   Or, please give input and I’ll make the updates.

Eric


> On Dec 12, 2019, at 2:39 PM, Eric Pugh <epugh@opensourceconnections.com> wrote:
> 
> I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010
<https://issues.apache.org/jira/browse/TIKA-3010>
> 
> And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>
> 
> My thought is to put something together that mimics how we deploy Solr, and see how that
works.   I have a need for an install process that a general IT person can follow, who isn’t
a Tika expert or a Docker users.
> 
> 
> 
> 
>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <mailto:mattmann@apache.org>>
wrote:
>> 
>> Thanks for bringing this conversation up Eric.
>> 
>> 
>> 
>> Historically if you look over the last 5 years, I think what you are asking below
has sort of already become the de facto
>> truth. Most people are in fact using Tika server, whether they are individual devs,
govvies, commercial folk and the like. 
>> 
>> Big, small and medium projects. Evidenced by the expansion of Tika APIs into pretty
much every PL I know and use of 
>> actively today.
>> 
>> 
>> 
>> Given that, we probably should update the main website docs to make this more prominent.
The tika server docs on the
>> wiki are pretty darn good. But they don’t get prime real estate. Would be wonderful
if someone wants to update the 
>> website to make it more prominent.
>> 
>> 
>> 
>> The downstream Tika Python lib that I maintain has tons of activity is used by more
than 350+ projects and relies solely
>> on Tika-Server. My recommendation to the Solr folks (having created 7633) from the
2014 DARPA MEMEX days was to 
>> move towards Tika Server based SolrCell dep and that’s the right way to go IMO.
>> 
>> 
>> 
>> Chris
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: Eric Pugh <epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>
>> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>" <dev@tika.apache.org
<mailto:dev@tika.apache.org>>
>> Date: Wednesday, December 4, 2019 at 12:24 PM
>> To: "tika-dev@apache.org <mailto:tika-dev@apache.org>" <tika-dev@apache.org
<mailto:tika-dev@apache.org>>
>> Subject: [EXTERNAL] Do we have a community supported approach for deploying Tika
Server in production?
>> 
>> 
>> 
>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>> 
>> 
>> 
>> Over in Solr land there has been renewed discussion about streamlining what Solr
is....   
>> 
>> 
>> 
>> In regards to rich content extraction and the Tika project, it seems like the two
ideas that continue to preserve the existing behavior are:
>> 
>> 
>> 
>> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   This
slims down the standard Solr download, and *might* make it easier to update the version of
Tika + dependent jars used?
>> 
>> 
>> 
>> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633
<https://issues.apache.org/jira/browse/SOLR-7633>) and just have Solr delegate the call
to Tika-Server.
>> 
>> 
>> 
>> 
>> 
>> I was thinking about why I like option 1 better than 2, and I think it boils down
to how mature the IT organization I am working with is.  Some IT organizations have large
dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes
with Load Balancer dynamically scaling up and down is simple and second nature!  However,
many organizations aren’t like that.
>> 
>> 
>> 
>> So I guess what I’m asking is do we have a reasonable supported approach for deploying
Tika Server for non-tika savvy organizations?   I’m thinking about Solr, and specifically
the fact that Solr has a well defined set of Service Installation scripts.   When I follow
the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>
I can feel confident that when the server is rebooted, then Solr will come back up!   Plus
there is log rotation and all the rest.
>> 
>> 
>> 
>> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm
<https://tika.apache.org/1.22/gettingstarted.htm> pagel, the message is to run Tika
as a command line application, or embedded in your application.   
>> 
>> 
>> 
>> I’m wondering if Tika-Server needs to be made more prominent, and treated as the
“primary method of interacting with Tika”?   Do we need as a community to focus more on
Tika-Server?   In our getting started documentation, in our usage documentation, and in our
examples?
>> 
>> 
>> 
>> Do we need to create the equivalent of the Service Installation scripts for Tika-Server?
  
>> 
>> 
>> 
>> Wanted to stoke the discussion!
>> 
>> 
>> 
>> Eric
>> 
>> 
>> 
>> _______________________
>> 
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>
| My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>  
>> 
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
      
>> 
>> This e-mail and all contents, including attachments, is considered to be Company
Confidential unless explicitly stated otherwise, regardless of whether attachments are marked
as such.
> 
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

> This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message