tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Date Thu, 09 Jan 2020 05:08:33 GMT
+1

 

Note there is also a USC tika dockers repo where I put the data science stuff too:

 

http://github.com/USCDataScience/tika-dockers

 

I’ll continue to push DL and ML Tika stuff there.

Cheers,

Chris

 

 

 

 

From: Dave Meikle <dmeikle@apache.org>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Wednesday, January 8, 2020 at 2:18 PM
To: "<dev@tika.apache.org>" <dev@tika.apache.org>
Subject: Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server
in production?

 

Hi Eric,

 

Will take a look. On a related note, I've created a new repos:

https://github.com/apache/tika-docker

 

Thinking based on looking at the PRs and Issues on LogicalSpark

docker-tikaserver, I'll create an updated docker file using what you've

added here and look to publish builds to docker hub from that.

 

What do you think?

 

Cheers,

Dave

 

 

 

On Wed, 8 Jan 2020 at 03:16, Eric Pugh <epugh@opensourceconnections.com>

wrote:

 

Hi all, I’ve gone ahead and added the -spawnChild property as a default

when running Tika Server as a service.   I’d love some eyes on the PR, and

if this looks good, get it committed.

 

Feedback welcome!

 

Eric

 

 

 

> On Dec 17, 2019, at 12:53 PM, Eric Pugh <epugh@opensourceconnections.com>

wrote:

> 

> Cool.

> 

> It’s the auto run that I really need, and the other part that I don’t

think I’ve tackled properly is the managing of logs…

> 

> I’m going to check with my project to see if they support Snap packages.

> 

> Eric

> 

> 

>> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:

tom@spicule.co.uk>> wrote:

>> 

>> Just saw this fly by and FYI on Linux systems that support Snap

packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`

doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`

and adding an init script wouldn’t take 5 minutes.

>> 

>> Tom

>> 

>> On 16 December 2019 at 18:42:55, Eric Pugh (

epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>)

wrote:

>> 

>>> Hi folks!

>>> 

>>> I’ve got a mostly completed PR for having install scripts for Tika

Server, and I’m hoping a committer will take a look at the PR, and give

feedback (and ideally commit in time for 1.24!)

>>> 

>>> A couple of things:

>>> 

>>> 1) This was completely influenced by

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

>< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,

in fact I started with the Solr scripts.

>>> 

>>> 2) I’ve deleted all the Solr specific aspects (I think), however there

may still be more to delete.

>>> 

>>> 3) This requires a change to how we release Tika, previously we ship

tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we

want to add the tika-server-bin.tgz and tika-server-bin.zip binary

distributions.

>>> 

>>> I’m happy to start writing accompanying “how to deploy Tika Server”

docs if this PR looks good! Or, please give input and I’ll make the updates.

>>> 

>>> Eric

>>> 

>>> 

>>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <

epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>

wrote:

>>> >

>>> > I’ve created this JIRA to track this work:

https://issues.apache.org/jira/browse/TIKA-3010 <

https://issues.apache.org/jira/browse/TIKA-3010> <

https://issues.apache.org/jira/browse/TIKA-3010 <

https://issues.apache.org/jira/browse/TIKA-3010>>

>>> >

>>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305

<https://github.com/apache/tika/pull/305> <

https://github.com/apache/tika/pull/305 <

https://github.com/apache/tika/pull/305>>

>>> >

>>> > My thought is to put something together that mimics how we deploy

Solr, and see how that works. I have a need for an install process that a

general IT person can follow, who isn’t a Tika expert or a Docker users.

>>> >

>>> >

>>> >

>>> >

>>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org

<mailto:mattmann@apache.org> <mailto:mattmann@apache.org <mailto:

mattmann@apache.org>>> wrote:

>>> >>

>>> >> Thanks for bringing this conversation up Eric.

>>> >>

>>> >>

>>> >>

>>> >> Historically if you look over the last 5 years, I think what you

are asking below has sort of already become the de facto

>>> >> truth. Most people are in fact using Tika server, whether they are

individual devs, govvies, commercial folk and the like.

>>> >>

>>> >> Big, small and medium projects. Evidenced by the expansion of Tika

APIs into pretty much every PL I know and use of

>>> >> actively today.

>>> >>

>>> >>

>>> >>

>>> >> Given that, we probably should update the main website docs to make

this more prominent. The tika server docs on the

>>> >> wiki are pretty darn good. But they don’t get prime real estate.

Would be wonderful if someone wants to update the

>>> >> website to make it more prominent.

>>> >>

>>> >>

>>> >>

>>> >> The downstream Tika Python lib that I maintain has tons of activity

is used by more than 350+ projects and relies solely

>>> >> on Tika-Server. My recommendation to the Solr folks (having created

7633) from the 2014 DARPA MEMEX days was to

>>> >> move towards Tika Server based SolrCell dep and that’s the right

way to go IMO.

>>> >>

>>> >>

>>> >>

>>> >> Chris

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >> From: Eric Pugh <epugh@opensourceconnections.com <mailto:

epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com

<mailto:epugh@opensourceconnections.com>>>

>>> >> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>

<mailto:dev@tika.apache.org <mailto:dev@tika.apache.org>>" <

dev@tika.apache.org <mailto:dev@tika.apache.org> <mailto:

dev@tika.apache.org <mailto:dev@tika.apache.org>>>

>>> >> Date: Wednesday, December 4, 2019 at 12:24 PM

>>> >> To: "tika-dev@apache.org <mailto:tika-dev@apache.org> <mailto:

tika-dev@apache.org <mailto:tika-dev@apache.org>>" <tika-dev@apache.org

<mailto:tika-dev@apache.org> <mailto:tika-dev@apache.org <mailto:

tika-dev@apache.org>>>

>>> >> Subject: [EXTERNAL] Do we have a community supported approach for

deploying Tika Server in production?

>>> >>

>>> >>

>>> >>

>>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user

question!

>>> >>

>>> >>

>>> >>

>>> >> Over in Solr land there has been renewed discussion about

streamlining what Solr is....

>>> >>

>>> >>

>>> >>

>>> >> In regards to rich content extraction and the Tika project, it

seems like the two ideas that continue to preserve the existing behavior

are:

>>> >>

>>> >>

>>> >>

>>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin)

for Solr. This slims down the standard Solr download, and *might* make it

easier to update the version of Tika + dependent jars used?

>>> >>

>>> >>

>>> >>

>>> >> 2) The second approach is to instead require Tika-Server to be

running (https://issues.apache.org/jira/browse/SOLR-7633 <

https://issues.apache.org/jira/browse/SOLR-7633><

https://issues.apache.org/jira/browse/SOLR-7633 <

https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr

delegate the call to Tika-Server.

>>> >>

>>> >>

>>> >>

>>> >>

>>> >>

>>> >> I was thinking about why I like option 1 better than 2, and I think

it boils down to how mature the IT organization I am working with is. Some

IT organizations have large dev-ops teams, and are working at major scale,

and managing a fleet of Tika-Server on Kubernetes with Load Balancer

dynamically scaling up and down is simple and second nature! However, many

organizations aren’t like that.

>>> >>

>>> >>

>>> >>

>>> >> So I guess what I’m asking is do we have a reasonable supported

approach for deploying Tika Server for non-tika savvy organizations? I’m

thinking about Solr, and specifically the fact that Solr has a well defined

set of Service Installation scripts. When I follow the directions in

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production

>< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>

I can feel confident that when the server is rebooted, then Solr will come

back up! Plus there is log rotation and all the rest.

>>> >>

>>> >>

>>> >>

>>> >> In contrast, when I look at Tika website, specifically

https://tika.apache.org/1.22/gettingstarted.htm <

https://tika.apache.org/1.22/gettingstarted.htm><

https://tika.apache.org/1.22/gettingstarted.htm <

https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is

to run Tika as a command line application, or embedded in your

application.

>>> >>

>>> >>

>>> >>

>>> >> I’m wondering if Tika-Server needs to be made more prominent, and

treated as the “primary method of interacting with Tika”? Do we need as a

community to focus more on Tika-Server? In our getting started

documentation, in our usage documentation, and in our examples?

>>> >>

>>> >>

>>> >>

>>> >> Do we need to create the equivalent of the Service Installation

scripts for Tika-Server?

>>> >>

>>> >>

>>> >>

>>> >> Wanted to stoke the discussion!

>>> >>

>>> >>

>>> >>

>>> >> Eric

>>> >>

>>> >>

>>> >>

>>> >> _______________________

>>> >>

>>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |

434.466.1467 | http://www.opensourceconnections.com <

http://www.opensourceconnections.com/><

http://www.opensourceconnections.com/ <

http://www.opensourceconnections.com/>><

http://www.opensourceconnections.com/ <

http://www.opensourceconnections.com/> <

http://www.opensourceconnections.com/ <

http://www.opensourceconnections.com/>>> | My Free/Busy <

http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <

http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>

>>> >>

>>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw

< 

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

< 

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw

< 

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>

 

>>> >>

>>> >> This e-mail and all contents, including attachments, is considered

to be Company Confidential unless explicitly stated otherwise, regardless

of whether attachments are marked as such.

>>> >

>>> > _______________________

>>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC |

434.466.1467 | http://www.opensourceconnections.com <

http://www.opensourceconnections.com/><

http://www.opensourceconnections.com/ <

http://www.opensourceconnections.com/>> | My Free/Busy <

http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>

>>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw

< 

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>

 

>>> > This e-mail and all contents, including attachments, is considered

to be Company Confidential unless explicitly stated otherwise, regardless

of whether attachments are marked as such.

>>> >

>>> 

>>> _______________________

>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467

| http://www.opensourceconnections.com <

http://www.opensourceconnections.com/><

http://www.opensourceconnections.com/ <

http://www.opensourceconnections.com/>> | My Free/Busy <

http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>

>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw

< 

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>

 

>>> This e-mail and all contents, including attachments, is considered to

be Company Confidential unless explicitly stated otherwise, regardless of

whether attachments are marked as such.

>>> 

>> 

>> Spicule Limited is registered in England & Wales. Company Number:

09954122. Registered office: First Floor, Telecom House, 125-135 Preston

Road, Brighton, England, BN1 6AF. VAT No. 251478891.

>> 

>> 

>> 

>> All engagements are subject to Spicule Terms and Conditions of

Business. This email and its contents are intended solely for the

individual to whom it is addressed and may contain information that is

confidential, privileged or otherwise protected from disclosure,

distributing or copying. Any views or opinions presented in this email are

solely those of the author and do not necessarily represent those of

Spicule Limited. The company accepts no liability for any damage caused by

any virus transmitted by this email. If you have received this message in

error, please notify us immediately by reply email before deleting it from

your system. Service of legal notice cannot be effected on Spicule Limited

by email.

>> 

> 

> _______________________

> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |

http://www.opensourceconnections.com <

http://www.opensourceconnections.com/> | My Free/Busy <

http://tinyurl.com/eric-cal>

> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

 

> This e-mail and all contents, including attachments, is considered to be

Company Confidential unless explicitly stated otherwise, regardless of

whether attachments are marked as such.

> 

 

_______________________

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |

http://www.opensourceconnections.com <

http://www.opensourceconnections.com/> | My Free/Busy <

http://tinyurl.com/eric-cal>

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

 

This e-mail and all contents, including attachments, is considered to be

Company Confidential unless explicitly stated otherwise, regardless of

whether attachments are marked as such.

 

 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message