tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Meikle <dmei...@apache.org>
Subject Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?
Date Wed, 08 Jan 2020 22:17:45 GMT
Hi Eric,

Will take a look. On a related note, I've created a new repos:
https://github.com/apache/tika-docker

Thinking based on looking at the PRs and Issues on LogicalSpark
docker-tikaserver, I'll create an updated docker file using what you've
added here and look to publish builds to docker hub from that.

What do you think?

Cheers,
Dave



On Wed, 8 Jan 2020 at 03:16, Eric Pugh <epugh@opensourceconnections.com>
wrote:

> Hi all, I’ve gone ahead and added the -spawnChild property as a default
> when running Tika Server as a service.   I’d love some eyes on the PR, and
> if this looks good, get it committed.
>
> Feedback welcome!
>
> Eric
>
>
>
> > On Dec 17, 2019, at 12:53 PM, Eric Pugh <epugh@opensourceconnections.com>
> wrote:
> >
> > Cool.
> >
> > It’s the auto run that I really need, and the other part that I don’t
> think I’ve tackled properly is the managing of logs…
> >
> > I’m going to check with my project to see if they support Snap packages.
> >
> > Eric
> >
> >
> >> On Dec 16, 2019, at 5:10 PM, Tom Barber <tom@spicule.co.uk <mailto:
> tom@spicule.co.uk>> wrote:
> >>
> >> Just saw this fly by and FYI on Linux systems that support Snap
> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`
> doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`
> and adding an init script wouldn’t take 5 minutes.
> >>
> >> Tom
> >>
> >> On 16 December 2019 at 18:42:55, Eric Pugh (
> epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>)
> wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I’ve got a mostly completed PR for having install scripts for Tika
> Server, and I’m hoping a committer will take a look at the PR, and give
> feedback (and ideally commit in time for 1.24!)
> >>>
> >>> A couple of things:
> >>>
> >>> 1) This was completely influenced by
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> ><
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
> in fact I started with the Solr scripts.
> >>>
> >>> 2) I’ve deleted all the Solr specific aspects (I think), however there
> may still be more to delete.
> >>>
> >>> 3) This requires a change to how we release Tika, previously we ship
> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
> distributions.
> >>>
> >>> I’m happy to start writing accompanying “how to deploy Tika Server”
> docs if this PR looks good! Or, please give input and I’ll make the updates.
> >>>
> >>> Eric
> >>>
> >>>
> >>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <
> epugh@opensourceconnections.com <mailto:epugh@opensourceconnections.com>>
> wrote:
> >>> >
> >>> > I’ve created this JIRA to track this work:
> https://issues.apache.org/jira/browse/TIKA-3010 <
> https://issues.apache.org/jira/browse/TIKA-3010> <
> https://issues.apache.org/jira/browse/TIKA-3010 <
> https://issues.apache.org/jira/browse/TIKA-3010>>
> >>> >
> >>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305
> <https://github.com/apache/tika/pull/305> <
> https://github.com/apache/tika/pull/305 <
> https://github.com/apache/tika/pull/305>>
> >>> >
> >>> > My thought is to put something together that mimics how we deploy
> Solr, and see how that works. I have a need for an install process that a
> general IT person can follow, who isn’t a Tika expert or a Docker users.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattmann@apache.org
> <mailto:mattmann@apache.org> <mailto:mattmann@apache.org <mailto:
> mattmann@apache.org>>> wrote:
> >>> >>
> >>> >> Thanks for bringing this conversation up Eric.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Historically if you look over the last 5 years, I think what you
> are asking below has sort of already become the de facto
> >>> >> truth. Most people are in fact using Tika server, whether they
are
> individual devs, govvies, commercial folk and the like.
> >>> >>
> >>> >> Big, small and medium projects. Evidenced by the expansion of Tika
> APIs into pretty much every PL I know and use of
> >>> >> actively today.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Given that, we probably should update the main website docs to
make
> this more prominent. The tika server docs on the
> >>> >> wiki are pretty darn good. But they don’t get prime real estate.
> Would be wonderful if someone wants to update the
> >>> >> website to make it more prominent.
> >>> >>
> >>> >>
> >>> >>
> >>> >> The downstream Tika Python lib that I maintain has tons of activity
> is used by more than 350+ projects and relies solely
> >>> >> on Tika-Server. My recommendation to the Solr folks (having created
> 7633) from the 2014 DARPA MEMEX days was to
> >>> >> move towards Tika Server based SolrCell dep and that’s the right
> way to go IMO.
> >>> >>
> >>> >>
> >>> >>
> >>> >> Chris
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> From: Eric Pugh <epugh@opensourceconnections.com <mailto:
> epugh@opensourceconnections.com> <mailto:epugh@opensourceconnections.com
> <mailto:epugh@opensourceconnections.com>>>
> >>> >> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>
> <mailto:dev@tika.apache.org <mailto:dev@tika.apache.org>>" <
> dev@tika.apache.org <mailto:dev@tika.apache.org> <mailto:
> dev@tika.apache.org <mailto:dev@tika.apache.org>>>
> >>> >> Date: Wednesday, December 4, 2019 at 12:24 PM
> >>> >> To: "tika-dev@apache.org <mailto:tika-dev@apache.org> <mailto:
> tika-dev@apache.org <mailto:tika-dev@apache.org>>" <tika-dev@apache.org
> <mailto:tika-dev@apache.org> <mailto:tika-dev@apache.org <mailto:
> tika-dev@apache.org>>>
> >>> >> Subject: [EXTERNAL] Do we have a community supported approach for
> deploying Tika Server in production?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user
> question!
> >>> >>
> >>> >>
> >>> >>
> >>> >> Over in Solr land there has been renewed discussion about
> streamlining what Solr is....
> >>> >>
> >>> >>
> >>> >>
> >>> >> In regards to rich content extraction and the Tika project, it
> seems like the two ideas that continue to preserve the existing behavior
> are:
> >>> >>
> >>> >>
> >>> >>
> >>> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin)
> for Solr. This slims down the standard Solr download, and *might* make it
> easier to update the version of Tika + dependent jars used?
> >>> >>
> >>> >>
> >>> >>
> >>> >> 2) The second approach is to instead require Tika-Server to be
> running (https://issues.apache.org/jira/browse/SOLR-7633 <
> https://issues.apache.org/jira/browse/SOLR-7633><
> https://issues.apache.org/jira/browse/SOLR-7633 <
> https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr
> delegate the call to Tika-Server.
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> I was thinking about why I like option 1 better than 2, and I think
> it boils down to how mature the IT organization I am working with is. Some
> IT organizations have large dev-ops teams, and are working at major scale,
> and managing a fleet of Tika-Server on Kubernetes with Load Balancer
> dynamically scaling up and down is simple and second nature! However, many
> organizations aren’t like that.
> >>> >>
> >>> >>
> >>> >>
> >>> >> So I guess what I’m asking is do we have a reasonable supported
> approach for deploying Tika Server for non-tika savvy organizations? I’m
> thinking about Solr, and specifically the fact that Solr has a well defined
> set of Service Installation scripts. When I follow the directions in
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> ><
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
> <
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>>
> I can feel confident that when the server is rebooted, then Solr will come
> back up! Plus there is log rotation and all the rest.
> >>> >>
> >>> >>
> >>> >>
> >>> >> In contrast, when I look at Tika website, specifically
> https://tika.apache.org/1.22/gettingstarted.htm <
> https://tika.apache.org/1.22/gettingstarted.htm><
> https://tika.apache.org/1.22/gettingstarted.htm <
> https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is
> to run Tika as a command line application, or embedded in your
> application.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I’m wondering if Tika-Server needs to be made more prominent,
and
> treated as the “primary method of interacting with Tika”? Do we need as a
> community to focus more on Tika-Server? In our getting started
> documentation, in our usage documentation, and in our examples?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Do we need to create the equivalent of the Service Installation
> scripts for Tika-Server?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Wanted to stoke the discussion!
> >>> >>
> >>> >>
> >>> >>
> >>> >> Eric
> >>> >>
> >>> >>
> >>> >>
> >>> >> _______________________
> >>> >>
> >>> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> 434.466.1467 | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/> <
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>>
> >>> >>
> >>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>>
>
> >>> >>
> >>> >> This e-mail and all contents, including attachments, is considered
> to be Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> >>> >
> >>> > _______________________
> >>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC |
> 434.466.1467 | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
> >>> > This e-mail and all contents, including attachments, is considered
> to be Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> >>> >
> >>>
> >>> _______________________
> >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/><
> http://www.opensourceconnections.com/ <
> http://www.opensourceconnections.com/>> | My Free/Busy <
> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
> >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>
> >>> This e-mail and all contents, including attachments, is considered to
> be Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
> >>>
> >>
> >> Spicule Limited is registered in England & Wales. Company Number:
> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
> >>
> >>
> >>
> >> All engagements are subject to Spicule Terms and Conditions of
> Business. This email and its contents are intended solely for the
> individual to whom it is addressed and may contain information that is
> confidential, privileged or otherwise protected from disclosure,
> distributing or copying. Any views or opinions presented in this email are
> solely those of the author and do not necessarily represent those of
> Spicule Limited. The company accepts no liability for any damage caused by
> any virus transmitted by this email. If you have received this message in
> error, please notify us immediately by reply email before deleting it from
> your system. Service of legal notice cannot be effected on Spicule Limited
> by email.
> >>
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> > This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message