tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Tikhonov <o...@apache.org>
Subject Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks
Date Thu, 06 Sep 2018 18:20:45 GMT
Ideally, tika server is dockerized, runs on swarm as a service. In
addition, it has healthckeck mechanism, say something ... like http get
request with return code 200. Docker will runs this hc periodically, and if
it fails, will restart tika server.
However, we are far away. Two ways to go, fmpov ... 1. Your second option
or ... os deamon which will check tika server availability or something
like that. We can use cron on Linux to run our "healthcheck" and if it
detects some anomalies, will restart a server. Probably for windows we can
find such mecanism as well.


On Thu, Sep 6, 2018, 18:29 Tim Allison (JIRA) <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605925#comment-16605925
> ]
>
> Tim Allison commented on TIKA-2725:
> -----------------------------------
>
> bq. What is tika-server typical env? stand-alone, distributed ... like
> replicas in cluster?
>
> It varies, I'm sure.  Not sure what most common use case is.  I would hope
> distributed -- swarm or similar.
>
> bq. Are there some time limitation for recovery?
>
> I think whoever starts the server should be able to set the threshold for
> timeouts per file...although I may misunderstand your question.
>
> bq.  How do we know what point to start processing from?
> That wouldn't be tika-server's problem.  Clients calling tika-server would
> get an error message, or potentially no response within a socket/http
> timeout range.  They should not reprocess those docs.
>
> bq. Do we mark documents which were processed?
> Same as above, that's a client concern.
>
> bq. For example, if tika-server had run on Docker swarm/K8S then
> orchestrator would have restarted a failed replica itself
> To confirm that I understand this correctly, currently, if the tika-server
> process dies, swarm/k8s will automatically restart it?  That's good to
> hear.  However, we don't currently have the watcher thread within
> tika-server to kill its own process on oom/timeout...so as it is now, it
> would have to be something catastrophic taking down tika-server (operating
> system, perhaps?).
>
>
>
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > ----------------------------------------------------------------
> >
> >                 Key: TIKA-2725
> >                 URL: https://issues.apache.org/jira/browse/TIKA-2725
> >             Project: Tika
> >          Issue Type: Task
> >            Reporter: Tim Allison
> >            Assignee: Tim Allison
> >            Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message