tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Date Tue, 17 Dec 2019 17:53:45 GMT
Cool.   

It’s the auto run that I really need, and the other part that I don’t think I’ve tackled
properly is the managing of logs…

I’m going to check with my project to see if they support Snap packages.

Eric


> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk> wrote:
> 
> Just saw this fly by and FYI on Linux systems that support Snap packages (Ubuntu/Debian/Arch/Fedora
etc) you can `snap install tika-server` doesn’t yet auto-run I don’t believe but you can
just run `tika-server.run` and adding an init script wouldn’t take 5 minutes.
> 
> Tom
> 
> On 16 December 2019 at 18:42:55, Eric Pugh (epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>)
wrote:
> 
>> Hi folks! 
>> 
>> I’ve got a mostly completed PR for having install scripts for Tika Server, and
I’m hoping a committer will take a look at the PR, and give feedback (and ideally commit
in time for 1.24!) 
>> 
>> A couple of things: 
>> 
>> 1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
in fact I started with the Solr scripts. 
>> 
>> 2) I’ve deleted all the Solr specific aspects (I think), however there may still
be more to delete.  
>> 
>> 3) This requires a change to how we release Tika, previously we ship tika-app.jar
and Tika-eval.jar, and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz
and tika-server-bin.zip binary distributions. 
>> 
>> I’m happy to start writing accompanying “how to deploy Tika Server” docs if
this PR looks good! Or, please give input and I’ll make the updates.
>> 
>> Eric 
>> 
>> 
>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>
wrote: 
>> >  
>> > I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010
<https://issues.apache.org/jira/browse/TIKA-3010> <https://issues.apache.org/jira/browse/TIKA-3010
<https://issues.apache.org/jira/browse/TIKA-3010>> 
>> >  
>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>
<https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305>>

>> >  
>> > My thought is to put something together that mimics how we deploy Solr, and
see how that works. I have a need for an install process that a general IT person can follow,
who isn’t a Tika expert or a Docker users. 
>> >  
>> >  
>> >  
>> >  
>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org <mailto:mattmann@apache.org>
<mailto:mattmann@apache.org <mailto:mattmann@apache.org>>> wrote: 
>> >>  
>> >> Thanks for bringing this conversation up Eric. 
>> >>  
>> >>  
>> >>  
>> >> Historically if you look over the last 5 years, I think what you are asking
below has sort of already become the de facto 
>> >> truth. Most people are in fact using Tika server, whether they are individual
devs, govvies, commercial folk and the like.  
>> >>  
>> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs
into pretty much every PL I know and use of  
>> >> actively today. 
>> >>  
>> >>  
>> >>  
>> >> Given that, we probably should update the main website docs to make this
more prominent. The tika server docs on the 
>> >> wiki are pretty darn good. But they don’t get prime real estate. Would
be wonderful if someone wants to update the  
>> >> website to make it more prominent. 
>> >>  
>> >>  
>> >>  
>> >> The downstream Tika Python lib that I maintain has tons of activity is used
by more than 350+ projects and relies solely 
>> >> on Tika-Server. My recommendation to the Solr folks (having created 7633)
from the 2014 DARPA MEMEX days was to  
>> >> move towards Tika Server based SolrCell dep and that’s the right way to
go IMO. 
>> >>  
>> >>  
>> >>  
>> >> Chris 
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >> From: Eric Pugh <epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>
<mailto:epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>>

>> >> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org> <mailto:dev@tika.apache.org
<mailto:dev@tika.apache.org>>" <dev@tika.apache.org <mailto:dev@tika.apache.org>
<mailto:dev@tika.apache.org <mailto:dev@tika.apache.org>>> 
>> >> Date: Wednesday, December 4, 2019 at 12:24 PM 
>> >> To: "tika-dev@apache.org <mailto:tika-dev@apache.org> <mailto:tika-dev@apache.org
<mailto:tika-dev@apache.org>>" <tika-dev@apache.org <mailto:tika-dev@apache.org>
<mailto:tika-dev@apache.org <mailto:tika-dev@apache.org>>> 
>> >> Subject: [EXTERNAL] Do we have a community supported approach for deploying
Tika Server in production? 
>> >>  
>> >>  
>> >>  
>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!

>> >>  
>> >>  
>> >>  
>> >> Over in Solr land there has been renewed discussion about streamlining what
Solr is....  
>> >>  
>> >>  
>> >>  
>> >> In regards to rich content extraction and the Tika project, it seems like
the two ideas that continue to preserve the existing behavior are: 
>> >>  
>> >>  
>> >>  
>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.
This slims down the standard Solr download, and *might* make it easier to update the version
of Tika + dependent jars used? 
>> >>  
>> >>  
>> >>  
>> >> 2) The second approach is to instead require Tika-Server to be running (https://issues.apache.org/jira/browse/SOLR-7633
<https://issues.apache.org/jira/browse/SOLR-7633><https://issues.apache.org/jira/browse/SOLR-7633
<https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr delegate the
call to Tika-Server. 
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >> I was thinking about why I like option 1 better than 2, and I think it boils
down to how mature the IT organization I am working with is. Some IT organizations have large
dev-ops teams, and are working at major scale, and managing a fleet of Tika-Server on Kubernetes
with Load Balancer dynamically scaling up and down is simple and second nature! However, many
organizations aren’t like that. 
>> >>  
>> >>  
>> >>  
>> >> So I guess what I’m asking is do we have a reasonable supported approach
for deploying Tika Server for non-tika savvy organizations? I’m thinking about Solr, and
specifically the fact that Solr has a well defined set of Service Installation scripts. When
I follow the directions in https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>
I can feel confident that when the server is rebooted, then Solr will come back up! Plus there
is log rotation and all the rest. 
>> >>  
>> >>  
>> >>  
>> >> In contrast, when I look at Tika website, specifically https://tika.apache.org/1.22/gettingstarted.htm
<https://tika.apache.org/1.22/gettingstarted.htm><https://tika.apache.org/1.22/gettingstarted.htm
<https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is to run Tika
as a command line application, or embedded in your application.  
>> >>  
>> >>  
>> >>  
>> >> I’m wondering if Tika-Server needs to be made more prominent, and treated
as the “primary method of interacting with Tika”? Do we need as a community to focus more
on Tika-Server? In our getting started documentation, in our usage documentation, and in our
examples? 
>> >>  
>> >>  
>> >>  
>> >> Do we need to create the equivalent of the Service Installation scripts
for Tika-Server?  
>> >>  
>> >>  
>> >>  
>> >> Wanted to stoke the discussion! 
>> >>  
>> >>  
>> >>  
>> >> Eric 
>> >>  
>> >>  
>> >>  
>> >> _______________________ 
>> >>  
>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
| http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/
<http://www.opensourceconnections.com/>><http://www.opensourceconnections.com/
<http://www.opensourceconnections.com/> <http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>>
| My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <http://tinyurl.com/eric-cal
<http://tinyurl.com/eric-cal>>>  
>> >>  
>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
 
>> >>  
>> >> This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless of whether attachments
are marked as such. 
>> >  
>> > _______________________ 
>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/
<http://www.opensourceconnections.com/>> | My Free/Busy <http://tinyurl.com/eric-cal
<http://tinyurl.com/eric-cal>>  
>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>

>> > This e-mail and all contents, including attachments, is considered to be Company
Confidential unless explicitly stated otherwise, regardless of whether attachments are marked
as such. 
>> >  
>> 
>> _______________________ 
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>>
| My Free/Busy <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>  
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>

>> This e-mail and all contents, including attachments, is considered to be Company
Confidential unless explicitly stated otherwise, regardless of whether attachments are marked
as such. 
>> 
> 
> Spicule Limited is registered in England & Wales. Company Number: 09954122. Registered
office: First Floor, Telecom House, 125-135 Preston Road, Brighton, England, BN1 6AF. VAT
No. 251478891.
> 
> 
> 
> All engagements are subject to Spicule Terms and Conditions of Business. This email and
its contents are intended solely for the individual to whom it is addressed and may contain
information that is confidential, privileged or otherwise protected from disclosure, distributing
or copying. Any views or opinions presented in this email are solely those of the author and
do not necessarily represent those of Spicule Limited. The company accepts no liability for
any damage caused by any virus transmitted by this email. If you have received this message
in error, please notify us immediately by reply email before deleting it from your system.
Service of legal notice cannot be effected on Spicule Limited by email.
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message